Data Quality, now there's an easy topic to tackle...obviously I'm joking.
The PI Interface for OPC DA has some performance counters that will give you Interface instance specific information on the Number of Points in Error, Stale for 10 mins/30mins/60mins/240mins. The information is quite basic but will give you an indication of the state of that particular data flow. Of course based on that basic information you would need to delve deeper to see what specifically are the issues.
You should use Abacus (Asset Based Analytics is its lesser known name ) for other basic quality detection algorithms, it is fairly straightforward as it uses the PE syntax. Have you already looked at Abacus? What this means is you can look at data quality in terms of the equipment supplying the data rather than (perhaps in addition to) the Interface instance providing the data. For example, you have 20 pumps all of which have their data provided by a single Interface instance. The "Points In Error" suddenly increases on that interface but without looking through the Points you have no indication if it is a specific pump, or many pumps that have gone into error. Spin it around and roll up a count of the attributes in error per pump and you can notify which pump(s) suddenly have a quality issue with their data - it gives you an initial pointer to investigate the quality issues without having to manually investigate, which is usually an expensive process by which time your users have already noticed the issue too.
To keep this weeks trend, a simple question deserves a complicated answer! ;-)
Statistics on the data series can indeed be generated as Rhys stated. I typically use AF Element Templates to convert the 'raw' PI tag value into something that is usable in your AF model, and will give you some separation point between the raw data and the data available for use in a given application. You may need to do UOM conversions, normalisations against reference levels, etc., but this is also the place where i would calculate my raw data quality statistics.
You can use the raw data statistics to determine the 'health' of a data series based on your expectations for that specific point. Combining all statistics on a individual point can give you a quality indicator for that point, and maybe even a confidence level %.
I've played with the concept of a rollup of the confidence levels of individual points on calculations and reports, to determine a confidence level for a single calculation or report. Because in the real world, getting your weekly report one week late due to the time required to resolve data issues is not an option. You have to make a decision now. So indicating that e.g. 86% of the data in the report is correct, can help in making the right decision at the time you need to make a call based on the data you have got.
For some processes, validated, or 100% correct and complete data is required. For these types of cases, a data quality process is typically setup to copy and enhance the raw data into a validated data stream, that is applicable for the given usage.
Yes, data quality is a very nice subject. Quite easy to tackle! As long as everybody agrees with me.. ;-)
I won’t let the side down with a short answer.
While there are a lot of metric regarding the health on an interface these don’t help determine the health of a single data stream. But there are way you can do this.
The answer to the question about built-in function is that there aren’t any specific to data quality, but the primitives are there to build the solution.
The first question is whether this is in real-time or post analysis? For real-time take a look at the PI for Stream Insight presentations. I don’t believe that Stream Insight is a long term option, it appears to have died on the vine, but some of the techniques are usable against say the AF SDK. You can also do a combination of real-time and post analysis.
To detect whether a value has saturated or frozen is actually quite difficult; at least to separate whether it has frozen or saturated (to be honest in our solution we don't try and differentiate). One suggestion I do have is to use the PI compression to detect this; this IMNHO is a lightweight method. Basically all you need to do is look at the number of archive records for the point over a window. If you have only 3 points archived over a 8 hour (compmax plays a role here) period then it is pretty certain that the value is not updating (we normally use a smaller window, say 10 minutes). If you are using a Float16 then the values that over or under range are marked with a system state; however, I would recommend against using float16 rather use float32 (over/under range values are not detected).
A word on the BadVal function. This function will return true if the value has any of the system states.
The PI OPC interface does provide a feature to log the OPC quality as a separate data stream. Depending on your OPC server implementation this can be useful for data quality. If memory serves me correctly the first 8 bits are the standard quality information and the second 8 bits are the vendor specific quality information (it could be the other way around). In my experience the vendors don’t do a create job here, but check with your OPC vendor. I would also suggest reading the PI OPC interface manual to see how OPC qualities are handled in terms of the system digital set.
Detecting outliers is a statistical operations. You could use a f-test or t-test for this; there are other techniques but these are pretty simple to implement.
I created a summary for the topic: Overview about Data Quality Handling options in PI System
Please feel free to add your comments, questions, implementations, challenges, etc. in the discussion.