Good news is you are probably part of the 1,000+ generation & utility companies around the globe that uses the ultimate real-time data infrastructure, the PI System! Chances are there are several years of operational data living in your PI archives ready to be used. That, in conjunction with other types of data such as relational and geospatial, can be combined and used for advanced analytics for that Big Data project everyone is talking about. From a technology standpoint, most of data falls under the four categories described in the following picture:
and since the OT world and Operations are traditionally heavy users of time-series data, that is where PI fits and excels. But that is nothing new so why do I need the PI System for a Big Data project when a Data Lake can store just about anything?
Good question and this is part of a longer conversation but let’s explore a few points.
- Data context: Data context is everything and shaping data lower in the process brings immense value while unstructured information makes it more difficult for non-data scientists to access the data. Think AF! One key aspect of Data Lakes is that they hold unstructured data, with schemas being built on-demand with every read. In a sense, this approach allows for flexibility, but it also has its problems.
- Data history: You are reading this guide so I am guessing your data is already in the PI System. Not only that, but you have gone through the exercise of setting up connections and configuring interfaces to your various data sources. Why redo all that work?
- Data preparation: As mentioned in Part I of this guide, the integrators available with the PI System can dramatically decrease the amount of data preparation time spent in the project. More time for working on what really matters.
- Data enhancement: Our goal is to build a complex and accurate model and a specialized tool will be used for the job but some calculations can (and should) be done in PI. For example, calculating limits, rates of change, accumulations and performance can be implemented with little effort in Asset Analytics and since this is important to many people, they can be made available to a larger audience with ease.
- Security: The PI System provides secure access to your operational data while limiting direct access to the ESP (Electronic Security Perimeter). This not only expands SCADA data beyond the controlroom but it also limits any negative impacts users and applications may have on critical systems.
In other words, the PI System is the bridge connecting your operational data and your business applications where one side has secure critical data and the other needs that information to provide users and application what they need to run their business intelligently.
As mentioned, to build advanced models such as a multivariate time-series forecasting analysis, other applications will be leveraged. For the purpose of this guide we will focus on Azure Machine Learning to develop our goal of a load forecasting model but in reality, any similar applications with equivalent capabilities could be used. For instance, when preparing the data in Part I, a different target could had been used and some, but not limited options are Hadoop, SAP Hana, TIBCO Spotfire and Tableau.
Let’s get started… our model’s dependent variable will be the load in MW at a given time and location. For that, we will be using historical loads and temperatures as the independent variables to be used as input in the multiple regression analysis.
We will need an account for AZURE Machine Learning Studio (if you don’t have one try it free at https://studio.azureml.net). After logging in, go ahead and create an experiment, it will be our test project. The project, when finished, will later be deployed as a webservice. As usual, let’s start from the end and take a look at what the experiment should look like when completed:
From square zero, this is what an empty experiment looks like:
We are ready to start building and for that we will use the ”Search experiment items” from the menu on the left (in red) to find the different modules that we will be used in the project. These modules perform individual tasks like importing data into the project, training and scoring the model, or even saving the final results to a defined location. Follow the steps and good luck!
1) Search for “Import data”. A module with the same name should appear as seem in the red box (1) below. Drag and drop the module (1) on the canvas (2) as shown:
Select the module box (2) and a configuration tab will appear on the right-hand side. Depending on the module, it will provide a wizard to guide you through the configuration required. In our case, we are connecting to an AZURE SQL Database and querying some data. The screenshot to the side illustrates the configuration to do just that. On the SQL query text area, we are bringing the columns of interest already mentioned using the following SQL statement:
SELECT [SubstationCode] ,[TimeStamp] ,[Ambient Temperature] ,[StationMW] ,[Year] ,[Month] ,[Day] ,[Hour] ,[Week of the Year] ,[Day of the Week] FROM [dbo].[AllSubstations] WHERE SubstationCode = 'ALD'
2) Now let’s Search for “Clean missing data” for some further data cleansing. Although not necessary, this step will exclude any rows of data without temperature or load data (if any) as these wouldn’t bring any value to the model and could potentially affect its training process. Repeat the process of dragging and dropping the module found onto the project canvas (this will be omitted on the following steps). On the configuration tab, pick ambient temperature and Station MW and select “remove entire row” as the cleaning mode as seen on the screenshot below.
3) Next, search for and add “Split data”. This module is responsible for separating the input data into two groups, one that will be used for training the model and the other that will be used to validate and score it after it is built. The only configuration necessary is the “Fraction of rows in the first output dataset” which I used 0.7. That means 70% of the data will be used for training and the remaining for to scoring.
4) Let’s repeat the same steps for “Train Model”. On its configuration panel, select “Station MW” as your variable to be predicted. This basically tells the model we are trying to predict the load based on all other variables.
5) Picking an algorithm. I won’t say here you should pick one over another or pretend to be an expert on it. For that, there are 1000+ papers on the subject published on IEEE that you can go read. The only thing I will say is that it should probably be a regression type of algorithm and in this guide I used a “Neural Network Regression”. If you don’t feel like reading and want to experiment a little, Microsoft offers a cheat sheet on how to choose between algorithms:
With that said, you can either search for “Neural Network Regression” or navigate to it by drilling down the path: Machine Learning -> Initialize Model -> Regression -> Neural Network Regression from the menu on the left.
Leave all the default configurations as it is.
6) Let’s do the same for “Score model” - You can use this module to generate predictions using a trained classification or regression model. It generates just the predicted numeric value, the station load in MW in our case, and after you have generated a set of scores using this module, you can connect the scored dataset to “Evaluate Model”, to generate a set of metrics used for evaluating the model’s accuracy (performance).
Nothing needs to be done under configuration.
7) Almost done, repeat the steps for “Evaluate Model” – As mentioned, you can use this module to measure the accuracy of a trained model. It computes a set of industry-standard evaluation metrics.
These metrics returned are generally designed to estimate the amount of error and how fit the model is by using the difference between observed and predicted values is small.
The following metrics are reported for evaluating regression models. When you compare models, they are ranked by the metric you select for evaluation.
- Mean absolute error (MAE) measures how close the predictions are to the actual outcomes; thus, a lower score is better.
- Root mean squared error (RMSE) creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.
- Relative absolute error (RAE) is the relative absolute difference between expected and actual values; relative because the mean difference is divided by the arithmetic mean.
- Relative squared error (RSE) similarly normalizes the total squared error of the predicted values by dividing by the total squared error of the actual values.
- Mean Zero One Error (MZOE) indicates whether the prediction was correct or not. In other words: ZeroOneLoss(x,y) = 1 when x!=y; otherwise 0
- Coefficient of determination, often referred to as R2, represents the predictive power of the model as a value between 0 and 1. Zero means the model is random (explains nothing); 1 means there is a perfect fit. However, caution should be used in interpreting R2 values, as low values can be entirely normal and high values can be suspect.
Let’s run the experiment and inspect the results. First the Scored results
Right click on the “Score model” module and select Dataset->Visualize. Below is an example of what it should look like. Notice in red that the Station MW was the known variable and the Scored Labels is the predicted one.
Follow the same steps for the “Evaluate Model” module and you will get something like the screenshot below:
Let’s now inspect the results in the SQL table used as a target in you “Export Data”:
The final step is to bring the data back to the PI Data Archive and since the forecast data results are in a relational database, one way to accomplish that is to use the PI RDBMS. Need help with this step? Follow this video from our Youtube channel:
Finally, the results! In the picture below, the blue pen represents the results using a linear regression model while the purple shows the forecast using the neural network regression. The last one, in orange, is the actual measured load values for this particular substation.
That’s it folks, I hope you liked it. If you want to play and compare other algorithm, here is a good resource on how to do just that:
I hope you liked it and good luck on your next project!
<<< BACK (Building a load forecasting model in PI - Part I )