sraposo

Asset Analytics Best Practices - Part 2: Data Density and Data Pattern

Blog Post created by sraposo on Mar 28, 2019

If you're looking for Part 1: Use Variables, it's right here.

 

Asset Analytics Best Practices Blog Posts:

In the upcoming months I will be publishing several blog posts on different Asset Analytics best practices.

 

The goal of these blog posts is to help prevent customers from falling into common pitfalls that lead to poor performance of the PI Analysis Service. This will be done by:

  1. Increasing the visibility of the PI Analysis Service Best Practices knowledge base article.
  2. Providing concrete examples that show why it's important to follow our best practices.

 

To show the performance of various setups, I will be using the logs of the PI Analysis Processor (pianlysisprocessor.exe) with the Performance Evaluation logger set to Trace. If you are not familiar with the logs, loggers and logging levels, more information can be found in our documentation here. Alternatively, there is also a full troubleshooting walkthrough video on Youtube here. The video shows a troubleshooting approach using the PI Analysis Processor logs, it does not however go over best practices.

 

Asset Analytics Best Practices - Part 2: Data Density and Data Pattern

To explain the best practices in this post, I'll start with a problem statement and work backwards. I'll first identify the symptoms of the issue, show the troubleshooting steps, identify the root cause and then fix it by applying some best practices.

 

Consider the following scenario: One morning you come into work and notice that your Analysis Service is skipping a ton of calculations and the skip count is continuously increasing:

 

This data comes from the PI Analysis Service performance counter named Skipped Count. If you are not currently storing the values of the PI Analysis Service performance counters in PI Points using the PI Interface for Performance Monitoring, I would strongly recommend for you to start doing so. The data in these performance counters is extremely valuable for both monitoring and troubleshooting.

 

There are a ton of users that configure analyses in your company and you have no idea what changed from yesterday when the system was performing well with no skipping at all.

 

Before jumping into the troubleshooting, you first research why the analysis service skips calculations. You find some hints in this piece of documentation:

 

 

The hints are:

  1. A group of calculations is all analyses based on the same analysis template with the same scheduling *
  2. If an entire group (not a specific analysis, but analysis template) takes too long to execute and more than 50 evaluations are queued the service will start skipping the oldest evaluation requests
  3. There is a configuration parameter that could be set to False to avoid skipping

 

* - Analyses that aren't based off a template are group together based on scheduling solely. Dependent analyses are grouped together based on scheduling.

 

Now you may be thinking, ok let's increase the number of 50 EvaluationsToQueueBeforeSkipping to something higher and try that out, or let's set IsLoadSheddingEnabled to false. Well both of these approaches, at this point in the troubleshooting, are not the correct ones. Skipping indicates that something may not be working well**. It's a much better idea to figure out what's not working well and try to fix it then to simply try to get the service to cope better by increasing the EvaluationsToQueueBeforeSkipping parameter (and use more resources) and hope that it works out. If you were to set IsLoadSheddingEnabled to false, you would turn your skipping problem into a lagging problem.

 

**-It's possible that you may need to change either parameters, but first you should always identify the cause and determine if it can be fixed at the source!

 

Ok so what's the next step? We need to figure out which group is skipping!

 

We could set the Performance Evaluation logger to trace and look at analyses that take a long time to evaluate. This may be a good approach in smaller systems, but it may be a more painful approach in larger systems and for this setup it wouldn't help at all. Instead let's look at the PI Analysis Service Statistics. You can get to them from PSE > on the Management Tab > under operations > Right Click:

 

 

In the statistics we can look for the SkippedEvaluationPercentage displayed for each group. I cheated a bit, for this blog post there is only 1 group in my system and it's to make the screenshots a little nicer .

 

Ok so now we found the group that is skipping. It's an analysis template named Add_One on the element template High_Frequency_Example in the AF Database called PISquare_Blog. Let's look at the configuration to see if this analysis template is doing anything that seems to be very expensive. We'll keep in mind that best practice KB article in case we're not sure!

 

The configuration is:

 

Ok this is a very simple analysis. It can't really get any simpler than that. Maybe the attribute 'Input' has an expensive data reference (big linked table, expensive formula, etc...) ?

 

It's a simple PI Point data reference.

Ok this is very strange this very simple analysis with a very simple input isn't performing well. Let's go back to the statistics to see if we can pull out another hint. Let's look at the EvaluationSummary:

 

This analysis template group is triggered on average every 10ms !

That's a very high triggering frequency. At this point we should figure out why the frequency is so high and fix it. Ideally triggering frequencies at the group level shouldn't be under 1s. Let's keep digging a little more because the AverageElapsedMilliSeconds is pretty low at 0.1ms. Depending on the number of analyses part of this group, this group could perform well enough. We can figure out the number of analyses by searching in the management tab using the template as a filter, or we can look directly in the statistics:

 

 

Ok so there's almost 1000 analyses part of this analysis group. It makes sense now that with such a high triggering frequency, even though the analysis template has a simple configuration, this group would perform badly.

 

Why is the triggering frequency so high? If I look at one of the triggering PI Points, it updates every second:

 

 

So why isn't the analysis group triggered every second? Let's look at another triggering PI Point on another element:

 

 

Ok so even though the PI Points update every 1s, they don't update exactly on the same sub-second. Given that my analysis template is configured with Event Triggered, these analyses get triggered multiple times within the same second!

 

Now we've found the exact root cause. The triggering is done in such a way that the analysis group has a sub-second triggering frequency and there are almost 1000 analyses in this group.

 

What are some possible solutions then?

 

1- We could set the triggering to periodic 1s. The only impact this would have is that the value of the input would be interpolated. This might not be a problem, or we could work around that using PrevVal(). Does this resolve the issue. From my system:

 

 

Ok so this does resolve the issue. We can see that the skipped count flat lines after the change. But is this the best solution ?

 

2- Now that I think about it, I actually don't need data every 1s in those PI Points. I actually only need data every 30s. Why would I store data that I don't need?

I can :

  • Change the scan rate on the interface/connector that is writing to these PI Points.
  • Change the Exception filtering settings so the PI Data Archive doesn't receive meaningless noise.
  • Change the Compression filtering settings so data is compressed out when moving from the snapshot to the archive. This wouldn't help with the triggering issue in this example, but it would help with the performance of any summary functions (TagAvg(), EventCount(), etc...)

 

I know this was a long one, here are some key takeaways.

 

Takeaways:

  1. When you're running into an issue, make sure you really understand what the issue is before starting to troubleshoot. Knowing what the issue is will help steer you in the right direction for troubleshooting.
  2. Don't jump the gun and change a configuration parameter without understanding why it absolutely needs to be changed and what the implications are.
  3. Once you've identified the root cause, there may be more than 1 solution. The best solution is typically the one closest to the source of the issue and that will prevent similar issues from coming up again.

(In this example, if we would have changed the analysis to periodic 1s, someone else may have fallen in the same trap with different analyses using the same PI Points !!!!)

 

Best Practices:

  1. Data that isn't needed shouldn't be received by the PI Data Archive (Interface Scan Rate & Exception filtering)
  2. Data that doesn't need to be stored shouldn't be stored in the PI Data Archive (Compression filtering)
  3. Be mindful of the scheduling on analyses. Don't run an analysis more frequently then it needs to be run.

 

Other Notes:

  • The data reference in this case was a PI Point. In terms of data density, be especially mindful of table lookups. The larger the table, the slower the lookup and therefore the slower the evaluation time in Asset Analytics

Outcomes