Data Preparation


When you have that great idea on how to optimize an expensive process that will make your boss lose his mind or maybe it is a post mortem analysis you need to get done to come to the bottom of a problem nobody has a clue on why it even happened, there is nothing better than spending countless hours gathering and prepping data rather than doing the investigative work itself. Cough.. cough... not really!



A Forbers article* highlights that on average, data preparation accounts for about 80% of the work of data scientists as the pie chart below shows. That is just an absurd amount of time that can be directly translated into dollar signs and it is by far the least pleasure part of the process.


*Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says


So let's try to take care of that and automate as much as possible this tedious task. This first part of building a load forecasting model will focus on preparing the data to be used by the machine learning application which will then use it to learn and build the forecasting model. We will be using tools that can highly automate this process. While trying to keep things simple for the sake of this guide, only temperature and time related variables will be used as input variables to try and predict the load. Although a simplistic approach, it can yield highly accurate results! So here is what we will use:


Input variables

  • Temperature - observed ambient temperature around the substation locations.
  • Time (Hour, day, month, etc) - year, month, day, hour, weekday, day of the month, etc. All extracted from the data value's timestamp.
  • Substation - Substation code will be used to set them apart. 

Output variable

  • MW - substation load we are trying to predict.


Initially, we will push historical data to a SQL database that lives in AZURE using the PI Integrator so that Azure Machine Learning can use it as its source for building the model. This will allow the ML's regression algorithms to use the several years of  load weather history to train itself so it can predict future loads using real-time data. To make this integration easier, all substations implement an AF element template called "OSIsoft Substation Template" and have been placed in a "flat" hierarchy in AF as shown below. The template will be used as a filter in the integrator:


The AF tree is available through the Integrator designer UI. Notice that the same hierarchy is available on the images above and below. With that said, let's navigate the AF tree to find our substations so we can select what variables to be transmitted (those are again, temperature, MW and substation code). It is important to apply that new element template as a filter so it can finds all substation matches (23 in this case). Let's go ahead and and build this Asset View as seen below.




But life isn't perfect and as you probably already know, operational data is often not evenly spaced and it is not uncommon to have gaps here and there that will affect our predictive algorithms. For example, below is a screenshot of what the input data we are using looks like. Each marker represents a data value in the archive and It is pretty clear that they are not all spaced the same way and wouldn't fit in a table/cube shape very well.



This is where the PI Integrator (in our case the PI Integrator for AZURE*) comes in handy. It will help streamline the data preparation saving us precious time!

PI Integrator for Microsoft AZURE - Overview


Data preparation - features by the Integrator

  • Cleanse

         Handles out of range, missing, and in our case null values. In our case, if there is no MW value we choose not to include it.


  • Augment

          Let's break down the timestamp into more variables. Year, month, day, hour, weekday and week of the year to be more precise. This will make the prediction a whole lot more accurate.


  • Shape

          The data will populate a table so it needs to fit, in other words we need evenly spaced data. Since we will be forecasting hourly loads, here we can configure that.



  • Transmit

          Publish the data! For the historical data backfill it only needs to run once. After the model is build we will do this again and have it run on a schedule.



Great, we are almost done with the "boring" part of our project. By now your data should had been published already. Let's inspect the resulting table (the data is in table in SQL, use any client tool to connect and query):




Fantastic, it looks just like we wanted. Off we go to the machine learning part of our guide. See you on Part II!