Asset Analytics is a widely used feature of PI Asset Framework (PI AF) that allows users to configure calculations using PI data and quickly apply it to many assets using AF Templates. Once users have implemented the business logic and scheduled the calculations, PI Analysis Service picks up these newly created analyses and executes them based on the specified schedule. For production systems, administrators should be able to easily monitor the health and execution status of analyses. In cases when administrators (or users) determine that analyses are not executing as expected - typically, when they notice that analysis outputs are not updating, it is required that the system provides ways to quickly troubleshoot and determine root cause of the problem.
Currently (PI Analysis Service 2017 R2 or before) administrators rely on the following information for monitoring and troubleshooting PI Analysis Service issues:
- Examining analysis status (errors or warnings) in PI System Explorer (PSE)
- Performance Counters
- Evaluation Statistics in Management Plugin in PSE [see Troubleshooting PI Analysis Service Performance Issues: High Latency]
- Tracing [see Accessing PI Analysis Service Evaluation Events Programmatically]
While the information currently available in PI System Explorer (PSE) is useful for troubleshooting, the process is tedious and only applicable when the system performance has already degraded. Users have often asked for a built-in support for monitoring service health or programmatic ways of accessing analysis runtime information that can be used to develop custom applications, e.g. dashboards or reporting tools. With the upcoming PI Server 2018 SP2 Release (scheduled to be released in Q1 2019), we are adding support for querying runtime information for analyses using AFSDK. In this blog, we'll discuss this new feature and present some sample scenarios where this new capability can be useful.
For PI AF 2018 SP2, we have implemented two new methods to query PI Analysis Service for runtime information:
public IEnumerable<IList<AFAnalysisService.RuntimeFieldValue>> QueryRuntimeInformation(string queryString, string fields) public IEnumerable<TObject> QueryRuntimeInformation<TObject>(string queryString, string fields, Func<IList<AFAnalysisService.RuntimeFieldValue>, TObject> factory)
These methods return a list of AFAnalysisService.RuntimeFieldValue values, where the type RuntimeFieldValue encapsulates values returned for specified fields and implements utility functions to easily convert these values to desired types (e.g. Guid, double, int, AFTime etc.). The argument queryString can be used to filter analyses by the supported fields from AFAnalysisService.RuntimeInformationFields. The argument fields is used to specify the requested fields as a space-separated list. At least one field from the following list of supported fields must be specified.
|Field name||Value type||Description|
|elementName||String||Analysis target element|
|template||String||Analysis template, empty if none|
|path||String||Full path of the analysis|
|status||String||Analysis status from ["Running", "Stopped", "Starting", "Stopping", "Error", "Warning", "Suspended"]|
|statusDetail||String||Error, warning and suspended status details when available, empty otherwise|
|lastEvaluationStatus||String||Last evaluation status from ["Success", "Error", "Skipped"]|
|lastEvaluationStatusDetail||String||Last evaluation error details when available, empty otherwise|
|lastLag||Double||Latest reported evaluation lag, where evaluation lag is defined as the amount of time (in milliseconds) from receiving a trigger event to executing and publishing the outputs for that trigger event|
|averageLag||Double||Evaluation lag averaged over all evaluations since service start (see lastLag)|
|lastElapsed||Double||The amount of time (in milliseconds) spent evaluating the latest trigger event for the analysis|
|averageElapsed||Double||Elapsed time averaged over all evaluations since service start (see lastElapsed)|
|lastTriggerTime||DateTime||Time stamp of the last triggering event|
|averageTrigger||Double||Average time span (in milliseconds) between two triggering events|
|successCount||Int32||Number of successful evaluations since service start|
|errorCount||Int32||Number of evaluations that resulted in errors since service start|
|skipCount||Int32||Number of evaluations that were skipped since service start|
Additionally, following directives can be used in queryString to sort and limit the number of items to return:
|Field name||Value type||Description|
|sortBy||String||Field name to sort by|
|sortOrder||String||"Asc" or "Desc" for ascending and descending sorting, respectively|
|maxCount||Int32||Maximum number or results to return|
The query language syntax is similar to that of AFSearch with the following notable differences:
- All the search conditions are implicitly AND. The AND keyword itself is not supported in the syntax.
- An empty string always matches an empty string, even for the Name filter.
- IN operator is supported only for strings.
- Contextually invalid comparisons such as "Name := 10" always default to FALSE, rather than throwing an exception.
- Time-stamps must be specified as strings.
- Ordering and limiting the number of responses are all part of the queryString syntax (applies to sortyBy, sortOrder, and maxCount).
Here are some example queries:
- name := 'EfficiencyCalc*'
- path := '*Substation1*Meter1*CurrentDraw*'
- lastTriggerTime :< '*-1m'
- status :in ('Running', 'Warning')
- lastEvaluationStatusDetail := 'Error' lastLag :> 10000
Here is some sample code that shows how to use these methods:
// Get a reference to service object var system = new PISystems()["turing"]; var analysisService = system.AnalysisService; // Retrieve query results as ordered fields var results = analysisService.QueryRuntimeInformation( queryString: "path:'*Database1*Steam*' status:in ('Running', 'Error') sortBy:'lastLag' sortOrder:'Desc'", fields: "id name status lastLag lastTriggerTime"); // First result var first = results.First(); // Convert return values to desired types using RuntimeFieldValue Guid guid = (Guid)first; Guid guid_alternative = first.ToObject<Guid>(); string name = first; double lastLag = (double)first; AFTime lastTrigger = (AFTime)first;
Let's look at some common scenarios where this new capability can be useful for troubleshooting. We'll be listing the query parameters used and the corresponding response (displayed in the form of a table for better readability).
1. Find all analyses that are not running due to configuration errors
When there are configuration errors, PI Analysis Service stops running the analysis, sets the status to error and displays detailed error information in PSE. You can now query all analyses that are running with some configuration error and detailed error information.
fields: path statusDetail
2. Find all analyses that are not running due to specific configuration error
In some cases, you may be interested in querying all analyses with a specific configuration error. For example, all analyses with error - no output defined:
queryString: status:='error' statusDetail:='*No Output*'
fields: path statusDetail
3. Find all analyses that encountered an evaluation error on last evaluation
When evaluating an analysis, analysis service may encounter an evaluation error due to mistakes in business logic, bad inputs or insufficient error handling. For example, adding a bad value to a good numeric value, division by zero, insufficient data for summary etc. (These are cases that typically result in a Calc Failed output). In such cases, PI Analysis Service continues to run the analyses.
You can query all analyses that have encountered an evaluation error on their last evaluation, as follows:
fields: elementName name lastEvaluationStatusDetail errorCount successCount lastTriggerTime
Note that we are also requesting the successCount and errorCount fields to determine if the evaluation error happens intermittently or if there is a flaw with the business logic. In the example below, most analyses have never evaluated successfully, so most likely these analyses are badly configured or are not handling error conditions appropriately.
There is a difference between configuration errors (discussed in scenario #1 above) and evaluation or runtime errors discussed in this scenario. Configuration errors refer to errors that result in invalid analysis configuration - e.g. syntax errors, being unable to resolve input attributes, none of the output variables mapped to an attribute etc. These errors are detected when PI Analysis Service tries to parse analysis configuration during initialization, and result in the analysis being set to error. Evaluation errors are encountered when analysis is evaluated and (typically) does not result in the analysis being stopped. These errors could be related to bad input data or flawed business logic and are generally transient in nature. In some cases, these errors may even be expected and appropriately handled by the business logic. The analysis status shown in PSE (running:green, warning:orange, error:red etc.) is used to communicate configuration errors and does not update for any transient evaluation errors, so it should not be used as a dashboard for monitoring such errors.
4. Find all analyses that have not evaluated after a specified time
Let's say that you notice that outputs for some of your analyses have stopped updating, while all other analyses seem to be working as expected. There may be several reasons for this - e.g. the analyses are event-triggered and PI Analysis Service is no longer getting events for the inputs, or analyses from a specific template are falling behind due to performance issues, or perhaps analyses are evaluating fine but somehow writing outputs to the PI Data Archive is being delayed or buffered due to network issues. In such cases, it may be useful to query all analyses that have not evaluated before a specified time.
For example, let's say for the event-triggered analyses shown below we expect all of them to evaluate at least once every 5 minutes, as we know that the analyses are configured to trigger on AF attributes that receive at least one event every 5 minutes. When looking at output data we notice that some of the output attributes that are driven by these analyses are not updating as expected (i.e. some of them have been stale for more than 5 minutes). We then randomly query data for a set of analyses from this template and notice that some of them have indeed not evaluated within past 5 minutes (as evident from lastTriggerTime for last analysis; assuming the query was executed at 2018-12-26 T23:55:00Z).
For further troubleshooting, we want to query all such analyses from the same template that have not evaluated in last 5 minutes. We can do that as follows:
queryString: name:='*rate calculation*' status:=in('running', 'warning') lastTriggerTime:<'*-5m' sortBy:='lastTriggerTime' sortOrder:'Desc'
fields: path lastTriggerTime
5. Find all analyses that have skipped evaluations
When load shedding is enabled, it's possible that PI Analysis Service may skip evaluating some analyses, either due to the system being overloaded or when a large chunk of events is retrieved for event-triggered analyses. In such cases, you may want to query all analyses that have skipped evaluations, so that you can backfill (or recalculate) them manually.
queryString: status:='running' skipCount:>0
fields: path skipCount successCount
6. Find most expensive analyses
As discussed in Troubleshooting PI Analysis Service Performance Issues: High Latency, in many cases, a few badly configured or expensive analyses can overload PI Analysis Service, causing skipped evaluations or high lag. You can now query analyses sorted by their average elapsed time, and look into optimizing analyses that are unexpectedly expensive.
Elapsed time is defined as the amount of time (in milliseconds) it takes to evaluate an analysis for a specific trigger event. This includes any time spent in accessing input data - i.e. either reading it from the in-memory data cache or retrieving it from the data source (e.g. PI Data Archive for PI Point Attributes) and doing the calculation. For e.g. the elapsed time for an analysis doing monthly summary for an input AF attribute would most likely be greater than an analysis performing simple data quality checks on the current value for the same AF attribute. Average elapsed time for an analysis is calculated by averaging the elapsed time reported for all evaluations for that analysis since PI Analysis Service was started.
For example, we can query top 15 most expensive analyses (on an average) as follows:
queryString: status:='running' sortBy:'averageElapsed' sortOrder:'Desc' maxCount:=15
path averageElapsed averageTrigger successCount
It may be worthwhile to examine configuration for these most expensive analyses and see if we need to optimize configuration. For example, analyses that are taking several seconds to evaluate may be using some badly configured Table Lookup Data Reference as input, or may be doing a summary calculation over a long time range.
Note that long elapsed time may not necessarily be a problem for analyses that trigger infrequently, especially when such expensive calculations are also justified by the business requirements. Therefore, this metric should be evaluated relative to other values such as - averageTrigger (how often is the analysis being triggered), averageLag (whether this analysis is falling behind) and skipCount (whether this analysis is skipping evaluations). In most cases, we would like to optimize expensive analyses that are also exhibiting performance-related issues. For example, we can query all analyses that have average lag and elapsed duration higher than some threshold values (say 5500 ms and 15 ms, respectively):
queryString: status:'running' averageLag:>5500 averageElapse:>15 sortBy:'averageLag' sortOrder:'Desc'
fields: elementName name averageLag averageElapsed averageTrigger successCount
In this example, we can see that RemainingLifeCalculation analyses on an average take ~860 ms to evaluate and are being triggered every 200 ms. These analyses are clearly overloading the system and will result in lag that will keep on increasing when load shedding is disabled, or skipped evaluations otherwise.
Average versus Last elapsed time: Note that average elapse time is calculated over all evaluations for the analysis, while last elapsed time refers to the amount of time it took to evaluate the latest trigger. For analyses that have been running for a long time, the average elapsed time can be heavily skewed by past evaluations. If you are looking to monitor recent degradation in performance or return to good performance, looking at last elapsed time (same for lag) may be more useful.
Performance Tip: For systems with large number of disabled analyses, it will be better to filter disabled analyses (by including status:='running' filter) when the query is only intended for runtime information such as lastTriggerTime, lastLag etc. For disabled analyses, PI Analysis Service will return default values for such fields.
We hope exposing analysis runtime information programmatically will help you in making sure that PI Analysis Service is running as expected and provides ways for quickly troubleshooting performance-related issues or unexpected behavior. Note that the features described in this blog post require PI Analysis Service (and PI AF Client) 2018 SP2 or higher. As of this writing, PI Server 2018 SP2 is scheduled to be released in Q1 2019. If you are interested in trying it out before official release, please feel free to reach out to Stephen Kwan. We'd love to hear from you, especially if you are interested in building a custom dashboard or reporting application using this API.
Software Development Team Lead, Asset Analytics