Lessons from a failed Machine Learning project - watch out for Future Creep

Blog Post created by stuart.watson Champion on Apr 15, 2019

No one likes to talk about their failures.

Failures show that we were wrong, and no one likes to be wrong (I thought I was wrong once but found I was mistaken).


Let me tell you about one of our failures.

Part of the paper making process involves chemical reactions, not all of which are stable. And when these unstable reactions go off-track, they can go off track quickly (explosion is such an evocative word). Safety is always job number one, and production always, always takes a back seat to safety, so there are many safety features to ensure that going off-track, while costly to production, is done safely.


In the problem we were trying to solve, part of the pulp bleaching process was going off-track, and recovering from the triggered safety measures led to 30 to 40 minutes of lost production per event. For a 24 hour operation, you can never get that lost production back, so lost production usually equates to lost money.

So everyone would prefer for the reactions to stay on-track, and to have plenty of warning of when we are heading to the ditch so that adjustments can be made.


So, how do we know when things are going wrong?


This problem looked like a good application for Machine Learning.

We had plenty of data - with years worth of readings for almost everything you could think of in and around the reaction vessel.

The data was clean, with relatively few gauging errors and lost points, and start of the events were sharply defined - down to the second.

So we gathered our historical data for all of the sensors that we thought might influence the reaction. We pulled several years of PI data, sampling every minute and the benefit of hindsight gave us how long it would be until the next event. This would be what we were trying to predict.

We fed this data into a Neural network trainer on the Microsoft Azure Machine Learning platform, and with the appropriate cleaning, splitting, training, and scoring came up with a model that appeared to work really well. With the historical data, we were able to predict and event thirty minutes prior over 50% of the time with an acceptable false positive rate.

This would have a significant improvement on production - enough to pay for my annual salary several times over. Woohoo! Win for team PI !!!


Using the Azure Machine Learning Web Services, we developed a way of getting the live data to the Machine Learning Platform, and the results back to PI (see Azure Machine Learning: PowerShell script to store the web service results), set the script to run every minute, and waited for enough events to prove it was working before declaring success.


And that is when reality came crashing down (well not really crashing - it was more of a slow creeping realization but that doesn't sound as dramatic).

Real, live model results were not matching the historical data, and we had no discernible edge. We were not predicting the events, and even worse, we had many, many false positives. It was a failure.


So what was going wrong - why was minute by minute historical data so different from minute by minute live data?

The reason was:

Future Creep.

The future was creeping into the past, into the historical data, by means of data compression and interpolation.


When the historical data was originally stored, reasonable non-zero values for the exception and compression deviations were used. For stable values (and for most of the time, the process is stable), most of the data is compressed away - which is typically good. However, when you have stable regions, followed by large changes, the effect of compression, and the resulting interpolation for the historical data can lead long range errors, as seen in the orange line below.


We can see that the difference between the historical and the blue "live" data is small (less than the compression deviation). However, the rapid increase starting at 21:00 can result, after the compression and then interpolation is applied, in a deviation from the stable value which goes back a significant amount of time. In other words, knowledge of the event is consistently being interpolated back into the historical data prior to the event.

And it is exactly these small but consistent differences from the stable values that Machine Learning picks up and identifies as meaningful.


And the damage was done. There was no way to recover the data that was compressed away. Removing "future creep" by not performing interpolation and using only the compressed data led to completely missing significant shifts in the data - as is illustrated by the gray line below, exactly what we were looking for.


When we tried these changes, the prediction level dropped significantly, and it made little improvement in the live prediction results.


What is the conclusion?

We tried to get good results - the payoff was well worth the effort - but ultimately we failed.

We could not produce predictions that were consistent enough for any action to be taken.

The historical data available to us was just not suitable for tackling the problem based on good decisions made years before this technology was even available.

So for now, we have moved on.


But we learned... a lot.


We learned that not all data is created equal; that the kind of data storage configurations needed for historical visualization, root cause analysis, and process improvement initiatives, are not the same as needed for problems tackled using machine learning, and that, unfortunately, the data collection settings for one can completely invalidate its use in the other.

We learned that data compression and interpolation makes the data prior to the kind of sudden events we were trying to predict unreliable for model training. The length of this "unreliability window" can go back days, depending on the stability of the data and how the data collection and tags configured.

We learned how to feed the Machine Learning monster from PI, and feed those results back into PI.

And we learned the type of problems that we should try to tackle given our historical data; "future creep" is less evident for events that are gradual and not catastrophic.


And we have plenty of those problems to tackle.