# PI Data and R - A graphical statistical case study

Blog Post created by Ahmad Fattahi on Apr 11, 2012

Following my previous posts (here and here) on integrating PI System with R, this post takes the story one step further by analyzing the data to find correlations between them. By doing this we make two major points; first, R is extremely powerful and efficient in analyzing large amounts of data and manipulating it statistically. You see from the code snippets that we get all we do here with about 10 lines of code! Second, we see the power of good graphics to interpret data. In fact, the same data set can reveal many different things if it is shown in different ways. Appropriate graphics make data talk.

We work with some real-world data here. In our San Leandro office, we collect outside temperature as well as the instantaneous power consumption level of the building in a PI Server. I exported the data (sampled at around 4pm once a day) over the past 1 year to a CSV file (using PI DataLink or the piconfig utility). Therefore, we have a CSV file with 3 columns: Timestamp, Power, Temperature. This file is attached for your reference. Note that there may be some data cleaning necessary before feeding it into the graphs. The next step is to read the data into a "dataframe" object in R. You can think of a dataframe as a super table which R can do lots of things with it.

```#Read the data from the CSV file
PowerTemp.df <- read.csv(file='C:\\Users\\afattahi\\Documents\\R\\Examples\\SL - Power - Temp - 1year - Cleaned.csv', header=TRUE)
```

The "\" character happens to be the escape character. First step is to convert the power and temperature to numeric values (they are read as "factors" in R):

```#Converting the Power and Temperature to numerical vectors
power.numeric <- as.double(as.vector(PowerTemp.df\$Power))
temperature.numeric <- as.double(as.vector(PowerTemp.df\$Temperature))
```

Now let's plot power vs. time:

```#Plot the power and temperature vs time as is
plot(PowerTemp.df\$Time, power.numeric, xlab="Date", ylab="Power")
``` As we see the date objects are just imported as strings and there is no real date order and the x-axis looks really bad. To fix this, we convert the date into "Date" objects in R and redo the plot for temperature and power:

```#Plot the power and temperature vs time as Date objects
plot(as.Date(PowerTemp.df\$Time, format="%d-%b-%Y"), power.numeric, xlab="Date", ylab="Power")
plot(as.Date(PowerTemp.df\$Time, format="%d-%b-%Y"), temperature.numeric, xlab="Date", ylab="Temperature")
```

As is intuitively expected as well, there seems to be some sort of correlation between the temperature and power consumption at the building. The higher the temperature the higher the power consumption due to AC at the building. To see this better lets plot power vs. temperature:

```#Plot the correlation between temperature and power using plot and smoothScatter
plot(temperature.numeric, power.numeric, xlab="Outside temperature", ylab="Power consumption")
```

Now it is way more obvious that there is some strong correlation between the temperature and power demand. The only problem with this graph is that it can fail showing the density of points in crowded areas as dark points tend to overlap each other; two overlapping dark points look the same as 10 overlapping points. To fix this issue let's use smoothScatter:

```smoothScatter(temperature.numeric, power.numeric, xlab="Outside temperature", ylab="Power consumption")
```

Now it looks beautiful! Not only the correlatoin is obvious, but also it shows some bifurcation in behavior; there are two obviously separate branches. In other words, it shows that there should be other parameter(s) that the relationship is conditioned on. The intuition of the underlying problem tells us that the weekends should be much lighter on the power because the AC set points are adjusted to save power. To test this let's add the "day of the week" to our dataframe and plot power vs. day of the week:

```#Get the week day out and plot power grouped by week day
PowerTemp.df <- transform(PowerTemp.df, Weekday=weekdays(as.Date(Time,format="%d-%b-%y"), abbreviate=TRUE))
qplot(PowerTemp.df\$Weekday, power.numeric,  position=position_jitter(w=0, h=1), xlab="Day of the week", ylab="Power")
```

Now it clearly shows that Saturdays and Sundays behave differently than the workdays. These two days correspond to the lower branch in the correlation graphs above.

As we saw above, R provides mighty tools to analyze large amounts of data and produce actionable graphics out of them. Together with PI System they make a power house in turning data into action. There is a reason we spend so much resources to collect many many pieces of data.