I have pondered with this topic for a while now and thought it might be a good discussion topic. A lot of architecture discussion I have been involved with go like “so we PI2PI the local data to the enterprise level, create BI views and then bolt on whatever big data analytic”. These must be also the three words that most likely get you through your next system architect interview (any permutation too).

But what does that really mean?

When you think about it, it goes back to the different time series that sensors are producing and the ones that data application expect. To make a longer story short, sensors create heteroscedastic time series, where you have unequal spacing between points in time. In PI this is mostly caused by the compression algorithms. On the other side data applications expect table like data structures (homoscedastic), where data are evenly spaced (I call them SQL data ....).

That’s where interpolation comes in: You can transform any heteroscedastic time series into a homoscedastic time series by interpolating on regular intervals.

So where is the problem?

Most data analytic are low frequency applications, the time scale might be minutes or even hours. On the other side sensor data are high frequency - let’s say seconds or a couple of seconds. So, when you apply data analytics you are performing two operations:

- down sampling
- and interpolation.

Just as an example, in batch analysis you might sample 100 data points from a total batch duration of a couple of days. This is a data point every 30 min … ! Therefor you are acquiring samples maybe at a 5-sec rate and then using only data every 30 min …?

So why even sample at a 5-sec. rate if you discard most of the samples …?

A better use of the data is to take advantage of the larger sample number by averaging the signal. In that approach, your standard deviation (e.g. noise) improves by 1/sqr(n). In the above example - 5 sec vs 30 min – that is a factor of almost 20! In PI this can be accomplished by using exponential moving average in AF Analysis.

IMO, interpolating at low frequency is really not a good idea, because it discards a large fraction of data that could be used to improve the data quality. Better is to create moving averages and interpolate of those to improve the signal-noise ration (SNR) and improve the model quality. Even shorter moving averages do improve the SNR and will benefit the model quality. One thing to keep in mind is that the model stdev depends on the data stdev - in multivariate statistics the noise of the prediction is at best twice that of the raw data. So every improvement on the raw data side will helps predictability on the model end.

Holger,

Great article, and I couldn't agree more. I have a customer who is falling into this very trap, whom I am trying to convince to take this approach of averaging the data to improve the data set quality presented to the analytics layer. They shall be reading this article soon to see that I'm not the only one who thinks this way!

Cheers,

John