Skip navigation
All Places > All Things PI - Ask, Discuss, Connect > PI Server > Blog > Author: nagarwal

PI Server

3 Posts authored by: nagarwal Employee

Introduction

Asset Analytics is a widely used feature of PI Asset Framework (PI AF) that allows users to configure calculations using PI data and quickly apply it to many assets using AF Templates. Once users have implemented the business logic and scheduled the calculations, PI Analysis Service picks up these newly created analyses and executes them based on the specified schedule. For production systems, administrators should be able to easily monitor the health and execution status of analyses. In cases when administrators (or users) determine that analyses are not executing as expected - typically, when they notice that analysis outputs are not updating, it is required that the system provides ways to quickly troubleshoot and determine root cause of the problem.

 

Currently (PI Analysis Service 2017 R2 or before) administrators rely on the following information for monitoring and troubleshooting PI Analysis Service issues:

 

While the information currently available in PI System Explorer (PSE) is useful for troubleshooting, the process is tedious and only applicable when the system performance has already degraded. Users have often asked for a built-in support for monitoring service health or programmatic ways of accessing analysis runtime information that can be used to develop custom applications, e.g. dashboards or reporting tools. With the upcoming PI Server 2018 SP2 Release (scheduled to be released in Q1 2019), we are adding support for querying runtime information for analyses using AFSDK. In this blog, we'll discuss this new feature and present some sample scenarios where this new capability can be useful.

 

AFAnalysisService.QueryRuntimeInformation Methods

For PI AF 2018 SP2, we have implemented two new methods to query PI Analysis Service for runtime information:

 

public IEnumerable<IList<AFAnalysisService.RuntimeFieldValue>> QueryRuntimeInformation(string queryString, string fields)
public IEnumerable<TObject> QueryRuntimeInformation<TObject>(string queryString, string fields, Func<IList<AFAnalysisService.RuntimeFieldValue>, TObject> factory)

 

These methods return a list of AFAnalysisService.RuntimeFieldValue values, where the type RuntimeFieldValue encapsulates values returned for specified fields and implements utility functions to easily convert these values to desired types (e.g. Guid, double, int, AFTime etc.). The argument queryString can be used to filter analyses by the supported fields from AFAnalysisService.RuntimeInformationFields. The argument fields is used to specify the requested fields as a space-separated list. At least one field from the following list of supported fields must be specified.

 

Field nameValue typeDescription
IDGuidAnalysis GUID
nameStringAnalysis name
categoryStringAnalysis category
descriptionStringAnalysis description
elementNameStringAnalysis target element
templateStringAnalysis template, empty if none
pathStringFull path of the analysis
statusStringAnalysis status from ["Running", "Stopped", "Starting", "Stopping", "Error", "Warning", "Suspended"]
statusDetailStringError, warning and suspended status details when available, empty otherwise
lastEvaluationStatusStringLast evaluation status from ["Success", "Error", "Skipped"]
lastEvaluationStatusDetailStringLast evaluation error details when available, empty otherwise
lastLagDoubleLatest reported evaluation lag, where evaluation lag is defined as the amount of time (in milliseconds) from receiving a trigger event to executing and publishing the outputs for that trigger event
averageLagDoubleEvaluation lag averaged over all evaluations since service start (see lastLag)
lastElapsedDoubleThe amount of time (in milliseconds) spent evaluating the latest trigger event for the analysis
averageElapsedDoubleElapsed time averaged over all evaluations since service start (see lastElapsed)
lastTriggerTimeDateTimeTime stamp of the last triggering event
averageTriggerDoubleAverage time span (in milliseconds) between two triggering events
successCountInt32Number of successful evaluations since service start
errorCountInt32Number of evaluations that resulted in errors since service start
skipCountInt32Number of evaluations that were skipped since service start

 

Additionally, following directives can be used in queryString to sort and limit the number of items to return:

 

Field nameValue typeDescription
sortByStringField name to sort by
sortOrderString"Asc" or "Desc" for ascending and descending sorting, respectively
maxCountInt32Maximum number or results to return

 

The query language syntax is similar to that of AFSearch with the following notable differences:

  • All the search conditions are implicitly AND. The AND keyword itself is not supported in the syntax.
  • An empty string always matches an empty string, even for the Name filter.
  • IN operator is supported only for strings.
  • Contextually invalid comparisons such as "Name := 10" always default to FALSE, rather than throwing an exception.
  • Time-stamps must be specified as strings.
  • Ordering and limiting the number of responses are all part of the queryString syntax (applies to sortyBy, sortOrder, and maxCount).

 

Here are some example queries:

  • name := 'EfficiencyCalc*'
  • path := '*Substation1*Meter1*CurrentDraw*'
  • lastTriggerTime :< '*-1m'
  • status :in ('Running', 'Warning')
  • lastEvaluationStatusDetail := 'Error' lastLag :> 10000

 

Here is some sample code that shows how to use these methods:

 

// Get a reference to service object
var system = new PISystems()["turing"];
var analysisService = system.AnalysisService;

// Retrieve query results as ordered fields
var results = analysisService.QueryRuntimeInformation(
queryString: "path:'*Database1*Steam*' status:in ('Running', 'Error') sortBy:'lastLag' sortOrder:'Desc'",
fields: "id name status lastLag lastTriggerTime");

// First result
var first = results.First();

// Convert return values to desired types using RuntimeFieldValue
Guid guid = (Guid)first[0];
Guid guid_alternative = first[0].ToObject<Guid>();
string name = first[1];
double lastLag = (double)first[3];

AFTime lastTrigger = (AFTime)first[4];

 

Sample Scenarios

Let's look at some common scenarios where this new capability can be useful for troubleshooting. We'll be listing the query parameters used and the corresponding response (displayed in the form of a table for better readability).

 

1. Find all analyses that are not running due to configuration errors

When there are configuration errors, PI Analysis Service stops running the analysis, sets the status to error and displays detailed error information in PSE. You can now query all analyses that are running with some configuration error and detailed error information.

queryString: status:='error'

fields: path statusDetail

 

2. Find all analyses that are not running due to specific configuration error

In some cases, you may be interested in querying all analyses with a specific configuration error. For example, all analyses with error - no output defined:

queryString: status:='error' statusDetail:='*No Output*'
fields: path statusDetail

 

3. Find all analyses that encountered an evaluation error on last evaluation

When evaluating an analysis, analysis service may encounter an evaluation error due to mistakes in business logic, bad inputs or insufficient error handling. For example, adding a bad value to a good numeric value, division by zero, insufficient data for summary etc. (These are cases that typically result in a Calc Failed output). In such cases, PI Analysis Service continues to run the analyses.

 

You can query all analyses that have encountered an evaluation error on their last evaluation, as follows:

queryString: lastEvaluationStatus:'error'
fields: elementName name lastEvaluationStatusDetail errorCount successCount lastTriggerTime

Note that we are also requesting the successCount and errorCount fields to determine if the evaluation error happens intermittently or if there is a flaw with the business logic. In the example below, most analyses have never evaluated successfully, so most likely these analyses are badly configured or are not handling error conditions appropriately.

 

 

There is a difference between configuration errors (discussed in scenario #1 above) and evaluation or runtime errors discussed in this scenario. Configuration errors refer to errors that result in invalid analysis configuration - e.g. syntax errors, being unable to resolve input attributes, none of the output variables mapped to an attribute etc. These errors are detected when PI Analysis Service tries to parse analysis configuration during initialization, and result in the analysis being set to error. Evaluation errors are encountered when analysis is evaluated and (typically) does not result in the analysis being stopped. These errors could be related to bad input data or flawed business logic and are generally transient in nature. In some cases, these errors may even be expected and appropriately handled by the business logic. The analysis status shown in PSE (running:green, warning:orange, error:red etc.) is used to communicate configuration errors and does not update for any transient evaluation errors, so it should not be used as a dashboard for monitoring such errors.

 

4. Find all analyses that have not evaluated after a specified time

Let's say that you notice that outputs for some of your analyses have stopped updating, while all other analyses seem to be working as expected. There may be several reasons for this - e.g. the analyses are event-triggered and PI Analysis Service is no longer getting events for the inputs, or analyses from a specific template are falling behind due to performance issues, or perhaps analyses are evaluating fine but somehow writing outputs to the PI Data Archive is being delayed or buffered due to network issues. In such cases, it may be useful to query all analyses that have not evaluated before a specified time.

 

For example, let's say for the event-triggered analyses shown below we expect all of them to evaluate at least once every 5 minutes, as we know that the analyses are configured to trigger on AF attributes that receive at least one event every 5 minutes. When looking at output data we notice that some of the output attributes that are driven by these analyses are not updating as expected (i.e. some of them have been stale for more than 5 minutes). We then randomly query data for a set of analyses from this template and notice that some of them have indeed not evaluated within past 5 minutes (as evident from lastTriggerTime for last analysis; assuming the query was executed at 2018-12-26 T23:55:00Z).

 

 

For further troubleshooting, we want to query all such analyses from the same template that have not evaluated in last 5 minutes. We can do that as follows:

queryString: name:='*rate calculation*' status:=in('running', 'warning') lastTriggerTime:<'*-5m' sortBy:='lastTriggerTime' sortOrder:'Desc'
fields: path lastTriggerTime

 

5. Find all analyses that have skipped evaluations

When load shedding is enabled, it's possible that PI Analysis Service may skip evaluating some analyses, either due to the system being overloaded or when a large chunk of events is retrieved for event-triggered analyses. In such cases, you may want to query all analyses that have skipped evaluations, so that you can backfill (or recalculate) them manually.

queryString: status:='running' skipCount:>0
fields: path skipCount successCount

 

6. Find most expensive analyses

As discussed in Troubleshooting PI Analysis Service Performance Issues: High Latency, in many cases, a few badly configured or expensive analyses can overload PI Analysis Service, causing skipped evaluations or high lag. You can now query analyses sorted by their average elapsed time, and look into optimizing analyses that are unexpectedly expensive.

Elapsed time is defined as the amount of time (in milliseconds) it takes to evaluate an analysis for a specific trigger event. This includes any time spent in accessing input data - i.e. either reading it from the in-memory data cache or retrieving it from the data source (e.g. PI Data Archive for PI Point Attributes) and doing the calculation. For e.g. the elapsed time for an analysis doing monthly summary for an input AF attribute would most likely be greater than an analysis performing simple data quality checks on the current value for the same AF attribute. Average elapsed time for an analysis is calculated by averaging the elapsed time reported for all evaluations for that analysis since PI Analysis Service was started.

For example, we can query top 15 most expensive analyses (on an average) as follows:

queryString: status:='running' sortBy:'averageElapsed' sortOrder:'Desc' maxCount:=15
path averageElapsed averageTrigger successCount

 

It may be worthwhile to examine configuration for these most expensive analyses and see if we need to optimize configuration. For example, analyses that are taking several seconds to evaluate may be using some badly configured Table Lookup Data Reference as input, or may be doing a summary calculation over a long time range.

 

Note that long elapsed time may not necessarily be a problem for analyses that trigger infrequently, especially when such expensive calculations are also justified by the business requirements. Therefore, this metric should be evaluated relative to other values such as - averageTrigger (how often is the analysis being triggered), averageLag (whether this analysis is falling behind) and skipCount (whether this analysis is skipping evaluations). In most cases, we would like to optimize expensive analyses that are also exhibiting performance-related issues. For example, we can query all analyses that have average lag and elapsed duration higher than some threshold values (say 5500 ms and 15 ms, respectively):

queryString: status:'running' averageLag:>5500 averageElapse:>15 sortBy:'averageLag' sortOrder:'Desc'

fields: elementName name averageLag averageElapsed averageTrigger successCount

 

In this example, we can see that RemainingLifeCalculation analyses on an average take ~860 ms to evaluate and are being triggered every 200 ms. These analyses are clearly overloading the system and will result in lag that will keep on increasing when load shedding is disabled, or skipped evaluations otherwise.

Average versus Last elapsed time: Note that average elapse time is calculated over all evaluations for the analysis, while last elapsed time refers to the amount of time it took to evaluate the latest trigger. For analyses that have been running for a long time, the average elapsed time can be heavily skewed by past evaluations. If you are looking to monitor recent degradation in performance or return to good performance, looking at last elapsed time (same for lag) may be more useful.

Performance Tip: For systems with large number of disabled analyses, it will be better to filter disabled analyses (by including status:='running' filter) when the query is only intended for runtime information such as lastTriggerTime, lastLag etc. For disabled analyses, PI Analysis Service will return default values for such fields.

 

We hope exposing analysis runtime information programmatically will help you in making sure that PI Analysis Service is running as expected and provides ways for quickly troubleshooting performance-related issues or unexpected behavior. Note that the features described in this blog post require PI Analysis Service (and PI AF Client) 2018 SP2 or higher. As of this writing, PI Server 2018 SP2 is scheduled to be released in Q1 2019. If you are interested in trying it out before official release, please feel free to reach out to Stephen Kwan. We'd love to hear from you, especially if you are interested in building a custom dashboard or reporting application using this API.

 

Thanks to Stephen Kwan and Giselle Lee for their inputs!

 

Regards,

Nitin Agarwal

Software Development Team Lead, Asset Analytics

Introduction

 

In a blog post earlier this year - Troubleshooting PI Analysis Service Performance Issues: High Latency - I discussed how to leverage detailed diagnostic information available for PI Analysis Service to troubleshoot performance-related issues. The evaluation statistics exposed in the Management Plugin can be used to identify analyses that are expensive and may be putting strain on the system.

 

This troubleshooting approach works in many cases, but has following drawbacks:

  • The statistics reported are accumulated over the duration that PI Analysis Service has been running. When service has been running for a long time, this information can be heavily skewed in favor of historical data, making it hard to troubleshoot recent degradation in performance.
  • The process is manual and we all have better things to do than sitting in front of the computer screen, trying to make sense of this information.

 

To this end, users want out-of-the-box health monitoring or to be able to access this information programmatically, so that they can build some sort of monitoring or reporting application on their own. In future releases, we plan to address both -- out-of-the box health monitoring and allowing users to programmatically access PI Analysis Service health information. In the meantime, if you are using PI Server 2017 or later release, you may still be able to access some of this information “programmatically”. Let me show you how.

 

Tracing Analysis Evaluation Events

 

Starting with PI Server 2016 R2 release, we started using NLog for tracing messages from the Analysis Service. If you are a .NET developer, chances are you are already familiar with NLog:

 

NLog is a free logging platform for .NET, NETSTANDARD, Xamarin, Silverlight and Windows Phone with rich log routing and management capabilities. NLog makes it easy to produce and manage high-quality logs for your application regardless of its size or complexity [source].

 

NLog is very flexible and what makes it powerful is that there are many built-in Targets, that can be used to pass log messages to another destination. In addition to targets, NLog also lets you define flexible Logging Rules, where we define the list of targets that should be written to for each combination of source/logger name and log level. When a rule matches, log messages are directed to target(s) in that rule. By default, analysis service uses File-Target, that supports cool features like automatic archiving, compression, etc. The file target is used for a set of predefined logging rules as shown here (you can find these in the configuration file $PIPC\Analytics\PIAnalysisProcessor.exe.config):

 

    <rules>
      <logger name="OSIsoft.AN*" levels="Fatal" writeTo="eventLog" />
      <logger name="*:Evaluation" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:Scheduling" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:Recalculation" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:DataRetrieval" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:Configuration" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:Performance" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="*:PerformanceEvaluation" minlevel="Info" writeTo="logfileAsync" final="true" />
      <logger name="OSIsoft.AN*" minlevel="Info" writeTo="logfileAsync" />
    </rules>

 

The Performance and PerformanceEvaluation rules were only added in PI Server 2017 release, so the steps described here require that version or higher.

 

Of these, the rule relevant here is with logger name: PerformanceEvaluation. When the log level is turned up all the way to Trace, analysis service uses this logger to write a message for each analysis evaluation. The message written to the log file (typically $ProgramData\OSIsoft\PIAnalysisNotifications\Logs\pianalysisprocessor-log.txt) looks like this:

 

2017-12-18 10:41:07.3432|TRACE|ANPerformanceTrace:PerformanceEvaluation|Type: AnalysisEvaluated, Data: {

  "AnalysisName": "\\\\DemoServer\\Database10\\Element1\\Element1_2\\Analyses[Analysis2]",

  "AnalysisId": "8228d87c-dac1-11e7-87d0-9cb6d0dc024e",

  "GroupId": "8b81d68e-da22-4a73-8174-f715aa7c83c8",

  "TimeclassId": "603223b0-ad81-4a75-b84b-e5cfb5254edb",

  "EvaluationData": {

    "ExeTime": "2017-12-18T18:41:02Z",

    "ElapsedMilliSeconds": 0.00948147283740368,

    "LagMilliSeconds": 5343.2489000000005,

    "AnalysisCount": 1

  }

}

 

This message contains following important information:

  • AnalysisName: Path of the analysis.
  • AnalysisId: AF Analysis Id.
  • GroupId: AF Analysis Template Id (unless the analysis doesn't derive from a template)
  • ExeTime: Trigger time for the evaluation
  • ElapsedMilliSeconds: Amount of time spent (in milliseconds) in evaluating this trigger event
  • LagMilliSeconds: Lag (in milliseconds) in evaluating this trigger event, includes the evaluation wait time (default=5 seconds)

 

Enabling this trace for a production system over a long period of time is not a good idea, as this can really cause your log file to grow fast!

WebService Target for Listening to Evaluation Events

 

While tracing analysis evaluation events is possible by just changing the trace level in the configuration file, this is clearly not the best way to access this information, as this will result in a large log file. This is where NLog's flexibility comes in handy. NLog comes with a built-in WebService Target, that allows us to write each trace message to a web service end point. All we need to do is to add a WebService target to the NLog configuration file, and direct the PerformanceEvaluation logger to write to this web service target. Typically, it's not a good idea to add custom NLog configuration to the PIAnalysisProcessor.exe.config file, as this file is overridden during upgrades. Recommended way to inject custom NLog configuration is to edit the user file - Analysis.UserNLog.config, which is located in the same folder. The only thing you need to do is to replace the contents of the file on your machine with the following:

 

<?xml version="1.0" encoding="utf-8"?>
<!-- Copy this file to the folder PIPC\Analytics folder -->
<nlog xmlns="http://www.nlog-project.org/schemas/NLog.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" autoReload="true" throwExceptions="false">
  <targets>
    <target xsi:type="AsyncWrapper" name="wsAsync" queueLimit="200000" fullBatchSizeWriteLimit="20" timeToSleepBetweenBatches="50" overflowAction="Discard">
      <target type='WebService'
          name='ws'
          url='http://localhost:9000/api/logmessage'
          protocol='HttpPost'
          encoding='UTF-8'   >
        <parameter name='Level' type='System.String' layout='${level}'/>
        <parameter name='Message' type='System.String' layout='${message}'/>
      </target>
    </target>
  </targets>
  <rules>
    <logger name="*:PerformanceEvaluation" levels="Trace" writeTo="wsAsync" final="true" />
  </rules>
</nlog>

 

Note that the WebService target is configured to write to end point address: http://localhost:9000/api/logmessage. You can change it based on the address you choose to listen to. Now all you need to do is to host a web service that listens to POST messages at this address. I have some sample code posted here. The sample uses OWIN to self-host ASP .NET Web API 2 application.

 

Enabling this tracing will have some overhead on the Analysis Service side, but that should be minimal and should be mitigated by the fact that the WebService target used here is wrapped in AsyncWrapper, which is using a queue limit of 200K and an overflow action of 'Discard'. If you choose to change these settings - such as increase the queue size or use overflowaction=Block or Grow, and do expensive things while processing the POST event in your web service implementation - bad things will happen. So you should be a bit careful when changing the settings and make sure that you don't do expensive blocking operations when handling POST event.

If you pulled the sample code and configured everything correctly, this is what you should see:

 

 

If all of this works, you can then use the analysis evaluation events to build a simple monitoring application - that can act when say an evaluation reports high lag or takes unreasonably long time to finish. In future releases, we plan to add more information to this event. For example:

  • Status - Success, Failure, Warning, Skipped
  • Error Information - that describes why certain evaluation resulted in Calc Failed.

 

With inputs from Stephen Kwan, David Pugal and Giselle Lee. Hope this information is useful.

Introduction

 

Asset Analytics is a feature of PI Asset Framework (PI AF) that allows you to perform calculations on your PI data using a set of built-in functions – Performance Equation, Statistical Quality Control (SQC) or Aggregation (Rollup). Some of the features that make Asset Analytics extremely useful are:

  • Ability to create calculations using a configuration driven environment (PI System Explorer) – you don’t need to be a programming wizard to implement your business logic
  • AF templates that allow you to write calculation logic once and instantly apply it to many assets.

While Asset Analytics makes it really easy to create calculations and apply to thousands (or millions) of assets with a few clicks, in some cases, you may find yourself in a situation where it may be hard to manage and keep the system running smoothly. One recommendation would be to pay attention to best practices and tips & tricks when building your calculations (as discussed here).

 

Assuming you already did that, in some cases, you may occassionally still need to troubleshoot issues like:

  • High latency: excessive delays in calculations
  • Skipping evaluations
  • Excessive CPU or memory consumption, etc.

In such situation, it would be useful to know what tools are at your disposal for troubleshooting performance-related issues.

 

In this blog post, we will walk through a scenario where the results of some calculations are being generated with a lot of delay (high latency) and we would like to understand the root cause. In this example, we will look at some new features added to the PI Analysis Service for the upcoming PI Server 2017 release. For this release, we are providing additional capabilities to the PI Analysis Service to keep track of the amount of time spent in evaluating each analysis, and performing other expensive tasks – such as retrieving input data using AF data pipes or publishing results (values written to AF attributes or checking-in new Event Frames). We can use these detailed evaluation statistics to track down worst performing analyses and hopefully optimize them and get the system back to running smoothly.

 

Setup

 

I imported the Asset Based PI Example Kit for Mobile Asset Performance Monitoring, and added some analyses that would cause the system to exhibit high lag for the purpose of this demonstration. We’ll talk about them later, as the whole point of this blog post is to find these bad actors.

 

Tip: By the way, if you have not already done so, you should check out some great example kits available here. I recently used some of these for testing and demos, and found them to be very useful.

 

When troubleshooting performance-related issues with Asset Analytics, you would find the following useful:

  • Performance Counters – PI Analysis Service creates several Windows performance counters that can be very useful in monitoring service health. I would highly recommend that you historize these counters (and save the data in PI System). If you ever run into a performance issue, we can look at the historical data for these counters and identify the time periods where performance degradation may have occurred. In addition to the counters described here, I would recommend to also add:
    • Process\%Processor Time
    • Process\Private Bytes
    • Process\Virtual Bytes

These counters should be added for PIAnalysisProcessor, PIAnalysisManager and PIRecalculationProcessor instances. For this discussion, I created the counters as shown here.

 

 

  • Analysis Service Statistics – PI Analysis Service exposes detailed service statistics in the Management plugin in PI System Explorer. Select Management from the navigation pane (on the left) and once in the Management UI, right click anywhere under Pending Operations and select View Analysis Service Statistics. Make sure that the PI Analysis Service is running and that the PI System Explorer is able to connect to the service.

 

Troubleshooting

 

After running the PI Analysis Service for a few hours, notice from the trend below, that the maximum latency value keeps on increasing and the current value is around 4903 seconds (maximum latency performance counter reports the value in milliseconds).

 

 

So, what is this maximum latency anyways?

Maximum latency is the maximum value among the latest latency values reported for all analyses. Latency is calculated as the amount of time (in milliseconds) from receiving a trigger event to executing and publishing the analysis results for that trigger event. Large and growing number of maximum latency is a common symptom of performance issues. This value also includes the time that PI Analysis Service waits before evaluating calculations in order to make sure that data for all inputs is available. This wait time can be configured by editing the parameter CalculationWaitTimeInSeconds. The default value for this parameter is 5 seconds. For more information, see PI System Explorer topic Analysis service configuration (or latest version).

 

Note that this is the maximum among the latest value reported for all analyses. This means that it is possible that most analyses in our system are working fine (i.e. producing timely results), and maybe only a few analyses are causing the system to be unstable.

 

In order to understand the analysis service statistics, one needs to know a few things about how the Calculation Engine evaluates analyses at scale. Note that this is oversimplification of some of the implementation details as the emphasis here is to just outline some key concepts required for the discussion at hand:

  • The calculation engine groups analyses together based on their schedule, analysis template and dependency information (i.e. whether the output of an analysis is used as as input for one or more other analyses).
  • Analyses within a group are evaluate together – for the same trigger event – before moving on to evaluate next trigger event (generated based on the schedule configured). This makes sure that analyses are evaluated in the correct time order.
  • The calculation engine is highly optimized to evaluate many groups in parallel, or wherever possible, to even split a group across many evaluation threads.
  • Calculation engine automatically handles dependencies, which means, that a calculation group further down the dependency chain must wait for the upstream groups to finish evaluating. This can introduce delays in analyses that may not be expensive to evaluate – but have to wait for parent analyses to finish evaluating.
  • The Calculation Engine periodically retrieves data for analysis inputs using bulk calls and caches them, to make sure that analyses do not have to make expensive remote calls (RPCs) to fetch data during each evaluation. In cases when the data required for evaluations is not available in the data cache (e.g. when performing a summary calculation over a long time range), the service needs to retrieve it from the source (e.g. a remote PI Data Archive), which could be expensive.
  • In most cases, majority of the time is spent in fetching input data and publishing results (data I/O), and is often the factor limiting scalability. So you can say, that most of the time the calculations are not CPU bound but I/O bound.

 

 

With that understanding, let’s get back to the high latency problem. From looking at the performance counter data, we know that the maximum latency keeps on increasing and we would like to know what’s causing it. Let’ see what information is available in the Analysis Service Statistics:

  • From the Management plugin, we bring up the Analysis Service Statistics window.
  • Then we navigate to the EvaluationStatistics node. In the bottom pane, you can see system level statistics, which is very useful in getting a good idea about the overall scale of this system – in terms of the data I/O rate and calculation throughput. In addition to these, we have following two important metrics:
    • CurrentLag: If this duration is high and keeps on increasing with time, this usually indicates that the PI Analysis Service in not able to keep up and is falling behind. If that’s the case, you would want to dig deeper to find out which analyses are causing this. In this case, we clearly have a high lag (1 hour 25 minutes).
    • EvaluationSkippedCount: If this number is non-zero, it implies that PI Analysis Service has skipped some evaluations. We can look at statistics for individual groups to see which groups have skipped evaluations. Note: PI Analysis Service skips evaluations when necessary if the parameter IsLoadSheddingEnabled is set to true (enabled by default).

 

  • Following this, we expand the EvaluationStatistics node and navigate to the EvaluationStatisticsForGroups node, which lists statistics for each calculation group that has triggered at least once. The groups are listed in descending order of lag. Recall that I described earlier how the calculation engine evaluates analyses in groups, so this should not be a surprise that the evaluation statistics are also reported per group. As you navigate the statistics for each group, you would note that:
    • Most group nodes are listed in the format: Template=[analysis template path] Schedule=[schedule for the analysis template].
    • Some groups are listed in the format 'Template=Multiple Schedule=[schedule]'. These groups represent analyses that have dependencies – i.e. the output of one analysis is used as an input for some other analysis. You can think of these special groups as – group of groups, consisting of groups of analyses that may come from different analysis templates, but all share the same schedule. All analyses belonging to this group are evaluate together (for a specific trigger event), such that analyses from templates that are higher in the dependency graph are evaluated first, followed by analyses that depend on them, and so on. Once the entire group is evaluated for a specific trigger event, the calculation engine moves on to evaluating the next trigger event.  

 

fig3.png

 

  • In this example, the group with title Template=Multiple Schedule=Periodic[Hertz=1] has a very high lag (1 hour 25 minutes), while all other groups have lag < 12 seconds (not visible in the picture, but trust me it’s true) -  which is reasonable (keeping in mind that this value includes the default wait time of 5 seconds). From this observation, we know immediately that analyses that belong to the group [Template=Multiple Schedule\Periodic[Hertz=1]] are the ones that are causing the high latency issue.
  • We can get more information about any group, by expanding the corresponding node. In this case, we expand the Template=Multiple Schedule=Periodic[Hertz=1] node and select the EvaluationSummary[Template=Multiple] node, which shows useful information for this specific group (in the bottom pane). Note that these statistics have been computed since the service was started (2 hours 19 minutes in this case, as shown by the Duration property). We have following information available:
    • Count: Number of times this group has evaluated
    • FirstTriggerTime: First trigger time evaluated
    • LastTriggerTime: Latest trigger time evaluated
    • AverageLagMilliSeconds: Average lag in milliseconds since service started
    • AverageElapsedMilliSeconds: Average amount of time spent (in milliseconds) in evaluating this group since service started
    • AverageTriggerMilliSeconds: Average time between two triggering events. We already know this value for analyses with periodic schedule (e.g. 1000 ms for analyses scheduled to trigger every second), but could be interesting to note for event-triggered analyses.
    • AverageAnalysisCount: Average number of analyses evaluated for the same trigger event. This would be same as the number of analyses from the template(s) for periodic schedule. But it could be different for event-triggered analyses since not all analyses may trigger at the same time.

In this case, we can see that the group triggers every 1000 ms, but takes ~2594 ms to evaluate on an average. This is obviously a problem.

 

 

  • For the groups with dependencies (Template=Multiple), the ChildrenSummary node further lists information about each template, in the descending order to average elapsed time. In this case, we have two analysis templates, that each trigger every 1000 ms, but take about 1413 ms and 1178 ms respectively.
    • Template: Air Inlet Pressure 90th Percentile

    • Template: Air Inlet Pressure Percentile

 

  • For each analysis template, we also show evaluation statistics for 5 most expensive analyses (ordered by average elapsed time), which can be valuable in finding worst performing analyses in that group.
  • Now that we know the two analysis templates that are responsible for high lag, we can look at their configuration to see what they are doing. In this case, as you would have guessed, these happen to be the ones that I had created at the beginning to cause high latency. We don’t really need to understand the calculation logic here, but knowing the fact that they are performing expensive 7-day summary (on 10 second input data), and are scheduled to evaluate every second. This is obviously a bit too aggressive for a weekly summary, and in most cases, would be the result of a misconfiguration.

 

 

 

  • Just for the sake of completeness, I changed the schedule for these analyses (based on the numbers above), from triggering every second to triggering every 10 seconds (still aggressive), based on the observation that the group was taking about 2.5 seconds (on an average) to evaluate. As expected, this promptly solved the high latency issue - as shown by the drop in maximum latency following configuration change.

 

 

Hopefully, this information would be helpful in tracking down expensive or misconfigured analyses in your system, so you could evaluate your calculation logic and see if anything could be optimized. As a reminder, many of the features described here are implemented in the PI Server 2017 release. I should mention that this is still a work-in-progress. We don’t expect users to have to drill down the service statistics to find information in this manner. Our goal is to ultimately surface performance related metrics in a more user-friendly and intuitive manner, allowing users to filter and sort analyses by these metrics. We would also like to expose some of this information programmatically (via AFSDK or PI Web API), so you could leverage this information in your own monitoring applications. Stay tuned for these enhancements in future releases.

 

With inputs from: Stephen Kwan, David Pugal and Giselle Lee