If you're looking for Part 1: Use Variables, it's right here.
If you're looking for Part 2: Data Density and Data Pattern, it's right here.
If you're looking for Part 3: Input Attributes, it's right here.
If you're looking for Part 4: Analyses in warning or error, it's right here.
Asset Analytics Best Practices Blog Posts:
In the upcoming months I will be publishing several blog posts on different Asset Analytics best practices.
The goal of these blog posts is to help prevent customers from falling into common pitfalls that lead to poor performance of the PI Analysis Service. This will be done by:
- Increasing the visibility of the PI Analysis Service Best Practices knowledge base article.
- Providing concrete examples that show why it's important to follow our best practices.
To show the performance of various setups, I may be using the logs of the PI Analysis Processor (pianlysisprocessor.exe) with the Performance Evaluation logger set to Trace. If you are not familiar with the logs, loggers and logging levels, more information can be found in our documentation here. Alternatively, there is also a full troubleshooting walkthrough video on Youtube here. The video shows a troubleshooting approach using the PI Analysis Processor logs, it does not however go over best practices.
Asset Analytics Best Practices - Part 5: Scheduling
This one will be short and similar to Nitin's blog post published back in 2017. If you haven't read Nitin's blog post, I'd highly recommend for you to do so. It's a great starting point to learn how to troubleshoot performance issues with Asset Analytics. You can find it here.
Let's consider the following analysis template, with this configuration:
and this scheduling:
The 'Input' attribute at the element level will be mapped to some of our random PI Points (sinusoid, cdt158, etc...).
Of course calculating a moving 3 year average every second doesn't make much sense for most use cases. We however unfortunately see this type of setup very frequently in customers system. The point I'm trying to get across with this post is that some thought needs to be put in the scheduling of analyses as there is a strong correlation between scheduling and performance.
IsLoadSheddingEnabled is set to False in this system. This is an important point to consider !
The Maximum Latency performance counter shows this alarming trend over the last 6 hours:
This tells us that there is at least one group of analyses (analyses based on the same template with the same scheduling) that is perpetually falling behind after each evaluation.
To find which group(s) is\are falling behind, we can have a look at the service statistics. These are accessible through PI System Explorer, on the management tab, under Operations, right click > View Analysis Service Statistics:
A lot of the run time statistics are also available programmatically via the AF SDK as of version 2018 SP2 (reference).
In the statistics we can quickly find the groups by looking at the CurrentLag statistic. In recent versions, the groups are ordered in descending order of CurrentLag. The group with the highest CurrentLag should be at the top of the list. This is the case in my system (2018 SP2). The only group lagging by much more than the default 5s wait time is this:
As we can see this group is 4:45:13 behind and based on the trend of the Maximum Latency performance counter this value will continuously increase.
If we drill down and look at the EvaluationSummary for the group, we will see WHY the group is perpetually lagging:
This group of analyses is composed of 17 individual analyses. It takes about 1.4s (AverageElapsedMilliSeconds) to evaluate the group and the group is triggered every 1s (AverageTriggerMilliSeconds, in this case since periodic it's the scheduling). Consequently, every time this group of analyses is evaluated, the system falls behind by about 0.4s. Since it always takes longer to evaluate the group than the rate at which evaluations are queued, the group falls perpetually behind and will never catch up.
Some may argue that it's odd for a group of only 17 analyses to fall behind. However, if you think about it, it only takes 1.4s to calculate 3 year averages for 17 PI Points, some of which update every 30s. The performance is very good. The issue here is that with a 1s scheduling we're not giving enough time for the analysis service to perform the calculations.
What's the fix here? We need to change the scheduling in order to give enough time for the analysis service to perform the calculations. We could in theory change the scheduling from 1s to 2s and this would resolve this performance issue. However, a better solution would be to schedule the analysis on what is needed for our use case. This will help us ensure our system is as scalable as it can be.
How often does it make sense to calculate a moving 3 year average? It really depends on the use case. It could be once a day, once a week or once a month. For this example, let's set it to once a week, on Monday.
You'll notice in the scheduling option, the largest frequency is once per day:
However, with the addition of the Exit() function in 2018 SP2, we can tweak the logic to only calculate the average once a week.
With the above daily schedule, the new configuration for a weekly average is:
I hope you trust me that this fixed the performance issue. I'll have to wait till next Monday to confirm
- It's important to schedule analyses on a need basis. There is no value in calculating analyses more frequently than needed, and this adds a considerable load on the service.
- Be very mindful when configuring Event-Triggered analyses. If there are a lot of triggering inputs, the analysis may be scheduled at a much higher frequency than expected. For example, two triggering inputs that update every 1s may have a timestamp that is a few milliseconds off. This would cause the analysis service to perform the calculation twice in the same 1s.
- We added an Exit() function in 2018 SP2. It's very useful to avoid unnecessary evaluations. If you can upgrade, it's worth it !
- I mentioned that IsLoadSheddingEnabled was set to false in this system. What would have happened if it was set to true? Would we still have a performance issue ?
If you have any questions of comments, please post below !