Dealing with Noise and Outlier

Blog Post created by ernstamort on Oct 5, 2018

The first time I read the work from  Dacorogna and Mueller on unevenly spaced time series operator in Finance, I thought this should have broad applications also in manufacturing data analysis. The underlying idea is to create a family of common operators such as moving average, standard deviation and deviation on the basis of the exponential moving average. The benefit is that you don't have to store a window of historical data in order to calculate the next estimate. In this approach you would need only to store the current and in some cases the previous value.

The calculations are then of course much faster than using the traditional approach and you don't have the network latency of going back to the server to request historical data. There is also the additional benefit that the calculation results are more continuous on regime changes. There are also some drawbacks, for example the algorithm needs a build up time so the calculation have to be preconditioned.

We have included a whole range of continuous operators in the Advanced Analytics Engine, but they can of course also be implemented in a standard Custom Data Reference:

EMA = exponential moving average (similar to Moving average - Wikipedia )

CMa = Continuous moving average

CMsd = Continuous moving standard deviation

and others. The benefit of these operators are that they can be easily combined to create new operators. The Z-score for example is really useful to identify outlier:

And a look of the normal distribution curve, shows how it works:

If the abs(Z-score) > 3 you can suspect that the value is an outlier. The amazing part of these operators is that the score can be computed in real-time and since the calculation is so fast you can perform outlier detection for every new point at high sampling rates.

To test this we can simulate a sinusoid and add some Gaussian noise. The first example shows how the CMa operator filters a very noisy signal:

Since we can compute the average and standard deviation in real-time, we can also now monitor the z-score on an event base. The following shows the real-time Zscore operator:

I used the absolute value, suspected samples can then be automatically flagged if the z-score is greater than 3.

Some properties of any moving operator are that:

a) you are basically creating a modified time-series - so these are not the original data points
b) there is also some lag of the filtered signal

Especially the first bullet point is important in the regulated industry, because they do insist on original values.

One of the things I have always pondered about is, if you can create a time-series with the original data points that denoises the signal. The OSIsoft compression algorithm by design over samples extreme values. So can you resample the original sample in a way to remove extreme values in real-time?

One way to do this is to calculate the median of a moving window and preserve the time stamp of the result:

The algorithm re samples the original data series, but now keeps the original data points that have a smaller distance to the signal. This can also be shown in the summary statistics and the histogram:

 Stat Raw Position Median Min -29.97 -14.06 Max 30.73 14.70 Average 0.62 0.60 StdDev 10.22 3.93

In my view the AF layer is the hand off between the engineering and data science teams. Data that are provided downstream should be conditioned and cleaned to improve the success rate of the modeling efforts.

But also Site applications such as alarming should be based on high quality data streams. Real-time operator and the re sampling methods are great tools that can be implemented with little efforts.