nagarwal

Troubleshooting PI Analysis Service Performance Issues: High Latency

Blog Post created by nagarwal Employee on Apr 24, 2017

Introduction

 

Asset Analytics is a feature of PI Asset Framework (PI AF) that allows you to perform calculations on your PI data using a set of built-in functions – Performance Equation, Statistical Quality Control (SQC) or Aggregation (Rollup). Some of the features that make Asset Analytics extremely useful are:

  • Ability to create calculations using a configuration driven environment (PI System Explorer) – you don’t need to be a programming wizard to implement your business logic
  • AF templates that allow you to write calculation logic once and instantly apply it to many assets.

While Asset Analytics makes it really easy to create calculations and apply to thousands (or millions) of assets with a few clicks, in some cases, you may find yourself in a situation where it may be hard to manage and keep the system running smoothly. One recommendation would be to pay attention to best practices and tips & tricks when building your calculations (as discussed here).

 

Assuming you already did that, in some cases, you may occassionally still need to troubleshoot issues like:

  • High latency: excessive delays in calculations
  • Skipping evaluations
  • Excessive CPU or memory consumption, etc.

In such situation, it would be useful to know what tools are at your disposal for troubleshooting performance-related issues.

 

In this blog post, we will walk through a scenario where the results of some calculations are being generated with a lot of delay (high latency) and we would like to understand the root cause. In this example, we will look at some new features added to the PI Analysis Service for the upcoming PI Server 2017 release. For this release, we are providing additional capabilities to the PI Analysis Service to keep track of the amount of time spent in evaluating each analysis, and performing other expensive tasks – such as retrieving input data using AF data pipes or publishing results (values written to AF attributes or checking-in new Event Frames). We can use these detailed evaluation statistics to track down worst performing analyses and hopefully optimize them and get the system back to running smoothly.

 

Setup

 

I imported the Asset Based PI Example Kit for Mobile Asset Performance Monitoring, and added some analyses that would cause the system to exhibit high lag for the purpose of this demonstration. We’ll talk about them later, as the whole point of this blog post is to find these bad actors.

 

Tip: By the way, if you have not already done so, you should check out some great example kits available here. I recently used some of these for testing and demos, and found them to be very useful.

 

When troubleshooting performance-related issues with Asset Analytics, you would find the following useful:

  • Performance Counters – PI Analysis Service creates several Windows performance counters that can be very useful in monitoring service health. I would highly recommend that you historize these counters (and save the data in PI System). If you ever run into a performance issue, we can look at the historical data for these counters and identify the time periods where performance degradation may have occurred. In addition to the counters described here, I would recommend to also add:
    • Process\%Processor Time
    • Process\Private Bytes
    • Process\Virtual Bytes

These counters should be added for PIAnalysisProcessor, PIAnalysisManager and PIRecalculationProcessor instances. For this discussion, I created the counters as shown here.

 

 

  • Analysis Service Statistics – PI Analysis Service exposes detailed service statistics in the Management plugin in PI System Explorer. Select Management from the navigation pane (on the left) and once in the Management UI, right click anywhere under Pending Operations and select View Analysis Service Statistics. Make sure that the PI Analysis Service is running and that the PI System Explorer is able to connect to the service.

 

Troubleshooting

 

After running the PI Analysis Service for a few hours, notice from the trend below, that the maximum latency value keeps on increasing and the current value is around 4903 seconds (maximum latency performance counter reports the value in milliseconds).

 

 

So, what is this maximum latency anyways?

Maximum latency is the maximum value among the latest latency values reported for all analyses. Latency is calculated as the amount of time (in milliseconds) from receiving a trigger event to executing and publishing the analysis results for that trigger event. Large and growing number of maximum latency is a common symptom of performance issues. This value also includes the time that PI Analysis Service waits before evaluating calculations in order to make sure that data for all inputs is available. This wait time can be configured by editing the parameter CalculationWaitTimeInSeconds. The default value for this parameter is 5 seconds. For more information, see PI System Explorer topic Analysis service configuration (or latest version).

 

Note that this is the maximum among the latest value reported for all analyses. This means that it is possible that most analyses in our system are working fine (i.e. producing timely results), and maybe only a few analyses are causing the system to be unstable.

 

In order to understand the analysis service statistics, one needs to know a few things about how the Calculation Engine evaluates analyses at scale. Note that this is oversimplification of some of the implementation details as the emphasis here is to just outline some key concepts required for the discussion at hand:

  • The calculation engine groups analyses together based on their schedule, analysis template and dependency information (i.e. whether the output of an analysis is used as as input for one or more other analyses).
  • Analyses within a group are evaluate together – for the same trigger event – before moving on to evaluate next trigger event (generated based on the schedule configured). This makes sure that analyses are evaluated in the correct time order.
  • The calculation engine is highly optimized to evaluate many groups in parallel, or wherever possible, to even split a group across many evaluation threads.
  • Calculation engine automatically handles dependencies, which means, that a calculation group further down the dependency chain must wait for the upstream groups to finish evaluating. This can introduce delays in analyses that may not be expensive to evaluate – but have to wait for parent analyses to finish evaluating.
  • The Calculation Engine periodically retrieves data for analysis inputs using bulk calls and caches them, to make sure that analyses do not have to make expensive remote calls (RPCs) to fetch data during each evaluation. In cases when the data required for evaluations is not available in the data cache (e.g. when performing a summary calculation over a long time range), the service needs to retrieve it from the source (e.g. a remote PI Data Archive), which could be expensive.
  • In most cases, majority of the time is spent in fetching input data and publishing results (data I/O), and is often the factor limiting scalability. So you can say, that most of the time the calculations are not CPU bound but I/O bound.

 

 

With that understanding, let’s get back to the high latency problem. From looking at the performance counter data, we know that the maximum latency keeps on increasing and we would like to know what’s causing it. Let’ see what information is available in the Analysis Service Statistics:

  • From the Management plugin, we bring up the Analysis Service Statistics window.
  • Then we navigate to the EvaluationStatistics node. In the bottom pane, you can see system level statistics, which is very useful in getting a good idea about the overall scale of this system – in terms of the data I/O rate and calculation throughput. In addition to these, we have following two important metrics:
    • CurrentLag: If this duration is high and keeps on increasing with time, this usually indicates that the PI Analysis Service in not able to keep up and is falling behind. If that’s the case, you would want to dig deeper to find out which analyses are causing this. In this case, we clearly have a high lag (1 hour 25 minutes).
    • EvaluationSkippedCount: If this number is non-zero, it implies that PI Analysis Service has skipped some evaluations. We can look at statistics for individual groups to see which groups have skipped evaluations. Note: PI Analysis Service skips evaluations when necessary if the parameter IsLoadSheddingEnabled is set to true (enabled by default).

 

  • Following this, we expand the EvaluationStatistics node and navigate to the EvaluationStatisticsForGroups node, which lists statistics for each calculation group that has triggered at least once. The groups are listed in descending order of lag. Recall that I described earlier how the calculation engine evaluates analyses in groups, so this should not be a surprise that the evaluation statistics are also reported per group. As you navigate the statistics for each group, you would note that:
    • Most group nodes are listed in the format: Template=[analysis template path] Schedule=[schedule for the analysis template].
    • Some groups are listed in the format 'Template=Multiple Schedule=[schedule]'. These groups represent analyses that have dependencies – i.e. the output of one analysis is used as an input for some other analysis. You can think of these special groups as – group of groups, consisting of groups of analyses that may come from different analysis templates, but all share the same schedule. All analyses belonging to this group are evaluate together (for a specific trigger event), such that analyses from templates that are higher in the dependency graph are evaluated first, followed by analyses that depend on them, and so on. Once the entire group is evaluated for a specific trigger event, the calculation engine moves on to evaluating the next trigger event.  

 

fig3.png

 

  • In this example, the group with title Template=Multiple Schedule=Periodic[Hertz=1] has a very high lag (1 hour 25 minutes), while all other groups have lag < 12 seconds (not visible in the picture, but trust me it’s true) -  which is reasonable (keeping in mind that this value includes the default wait time of 5 seconds). From this observation, we know immediately that analyses that belong to the group [Template=Multiple Schedule\Periodic[Hertz=1]] are the ones that are causing the high latency issue.
  • We can get more information about any group, by expanding the corresponding node. In this case, we expand the Template=Multiple Schedule=Periodic[Hertz=1] node and select the EvaluationSummary[Template=Multiple] node, which shows useful information for this specific group (in the bottom pane). Note that these statistics have been computed since the service was started (2 hours 19 minutes in this case, as shown by the Duration property). We have following information available:
    • Count: Number of times this group has evaluated
    • FirstTriggerTime: First trigger time evaluated
    • LastTriggerTime: Latest trigger time evaluated
    • AverageLagMilliSeconds: Average lag in milliseconds since service started
    • AverageElapsedMilliSeconds: Average amount of time spent (in milliseconds) in evaluating this group since service started
    • AverageTriggerMilliSeconds: Average time between two triggering events. We already know this value for analyses with periodic schedule (e.g. 1000 ms for analyses scheduled to trigger every second), but could be interesting to note for event-triggered analyses.
    • AverageAnalysisCount: Average number of analyses evaluated for the same trigger event. This would be same as the number of analyses from the template(s) for periodic schedule. But it could be different for event-triggered analyses since not all analyses may trigger at the same time.

In this case, we can see that the group triggers every 1000 ms, but takes ~2594 ms to evaluate on an average. This is obviously a problem.

 

 

  • For the groups with dependencies (Template=Multiple), the ChildrenSummary node further lists information about each template, in the descending order to average elapsed time. In this case, we have two analysis templates, that each trigger every 1000 ms, but take about 1413 ms and 1178 ms respectively.
    • Template: Air Inlet Pressure 90th Percentile

    • Template: Air Inlet Pressure Percentile

 

  • For each analysis template, we also show evaluation statistics for 5 most expensive analyses (ordered by average elapsed time), which can be valuable in finding worst performing analyses in that group.
  • Now that we know the two analysis templates that are responsible for high lag, we can look at their configuration to see what they are doing. In this case, as you would have guessed, these happen to be the ones that I had created at the beginning to cause high latency. We don’t really need to understand the calculation logic here, but knowing the fact that they are performing expensive 7-day summary (on 10 second input data), and are scheduled to evaluate every second. This is obviously a bit too aggressive for a weekly summary, and in most cases, would be the result of a misconfiguration.

 

 

 

  • Just for the sake of completeness, I changed the schedule for these analyses (based on the numbers above), from triggering every second to triggering every 10 seconds (still aggressive), based on the observation that the group was taking about 2.5 seconds (on an average) to evaluate. As expected, this promptly solved the high latency issue - as shown by the drop in maximum latency following configuration change.

 

 

Hopefully, this information would be helpful in tracking down expensive or misconfigured analyses in your system, so you could evaluate your calculation logic and see if anything could be optimized. As a reminder, many of the features described here are implemented in the PI Server 2017 release. I should mention that this is still a work-in-progress. We don’t expect users to have to drill down the service statistics to find information in this manner. Our goal is to ultimately surface performance related metrics in a more user-friendly and intuitive manner, allowing users to filter and sort analyses by these metrics. We would also like to expose some of this information programmatically (via AFSDK or PI Web API), so you could leverage this information in your own monitoring applications. Stay tuned for these enhancements in future releases.

 

With inputs from: Stephen Kwan, David Pugal and Giselle Lee

Outcomes