# PI Data and R - Single Variable Statistics

Blog Post created by Ahmad Fattahi on May 18, 2012

Following my previous blog post here, I am focusing on the single variable statistics for a PI tag using R. The goal here is to dig deeper into the data we have collected in the PI System and enable the Power of Data. As we will see, some very small code snippets can generate huge analytical and visual value out of data. All of this comes at no monetary price as long as you have your PI System in place; R is a free tool!

We base our operations on the same dataset as in the previous example, being the power and temperature data gathered from OSIsoft headquarters building in San Leandro, CA. We assume that the sampled data is already exported to a CSV file using PI DataLink or other methods such as PIConfig. The data is sampled one per day going back for a full year.

First we read the data,  from the CSV file into a variable called PowerTemp.df. The type of the data is called data frame in R. A data frame is pretty much like a table having rows and columns. The values in each column are usually of the same type representing a variable, such as temperature or time. Each row will be an entry or observation. Note that the whole structure is easily read from the CSV file. The names of the variables are automatically set from the headers in the CSV file:

```#Read the data from the CSV file
PowerTemp.df <- read.csv(file='C:\\Users\\afattahi\\Documents\\R\\Examples\\SL - Power - Temp - 1year - Cleaned.csv', header=TRUE)

#Converting the Power and Temperature to numerical vectors
power.numeric <- as.double(as.vector(PowerTemp.df\$Power))
temperature.numeric <- as.double(as.vector(PowerTemp.df\$Temperature))
```

Next step will be to look into the individual distribution of each variable, temperature and power consumption. The histogram will give us this information. This should give us more insight into how the variables have been behaving over the desired period.

```#Plotting the simple histogram of the power with 20 bins
hist(power.numeric, breaks=20, col='blue')
``` It clearly shows the behavior we witnessed before: there is two types of behavior or distribution; one related to weekends (base power) and the other one to the working days. Now let's fit a density function to the histogram above using function density() for a smoother description of our data:

```#Calculating the density function: we get an error due to NA in data. We need to clean it out.

d <- density(power.numeric)

Error in density.default(power.numeric) : 'x' contains missing values
```

Oops! We get an error. The problem is that our dataset contains some NA values (Not Available). The density() function cannot handle that. This is a classic example of the need for data cleaning. There is a saying that 80% of a data scientist or engineer's time is spent on cleaning data and 20% on algorithms and generating insight! So, let's take the NA out. R can do this very efficiently. In the snippet below we clean out the data, calculate the density and plot the density function. The density object, d, contains all the statistical description of the graph:

```#Cleaning the data
power.numeric.clean <- power.numeric[!is.na(power.numeric)]

#Create the density and plot it
d <- density(power.numeric.clean)
plot(d)
polygon(d, col="red", border="blue")
```

Now it is evident that the behavior to the right (right lobe - weekdays - 5 days a week) is the dominant behavior as opposed to the left one (left peak - base power - weekdays - 2 days a week) - Beautiful!

To put the icing on the cake, let's look at the distribution of the power in different seasons and compare them. We define seasons as: Jan-Mar as winter, Apr-Jun as spring, Jul-Sep as summer, and Oct-Dec as Fall. The first step is to extract the vector of months (1-12) from the timestamps in our dataset and clean it:

```#Extract the vector of months and clean it
months.vector <- as.numeric(format(as.Date(PowerTemp.df\$Time, format="%d-%b-%Y"), "%m"))
months.vector.clean <- months.vector[!is.na(power.numeric)]
```

Here is a very important step. We need to bin this vector according to seasons. In other words, take the vector of months and attach the corresponding season to each entry in the vector based on our definition of seasons. We use the function cut() to do so. It generates a factor which is another data structure in R. A factor is an ordered vector of objects; every realized value in the whole list is called a level. Factors are very good to represent observations of categorical values, in this case seasons.

```#Create the factor of seasons
seasons <- cut(months.vector.clean, breaks=c(0,3,6,9,12), labels=c("Winter", "Spring", "Summer", "Fall"))
```

Now we are ready to compare the distributions of the values of power per season. To do so we use the function sm.density.compare(). That's why we load the package sm first. The beauty of it is that once we know what we are doing everything is done with very few lines of code and becomes intuitive.

```#Compare the distribution of power consumption by season
require(sm)
sm.density.compare(power.numeric.clean, seasons, xlab="Power consumption by season")
legend("topright", levels(seasons), fill=2+0:(length(levels(seasons))-1), legend=c("Winter", "Spring", "Summer", "Autumn"))
```

It shows that the dual behavior is again observed in each individual season; so this is an intrinsic behavior of the underlying process. The only curious point is that in spring the baseline power is dominant. It can be because of the moderate weather in California in spring time where there is very little energy used to cool or heat the building. To see the different behavior by season we can look at the box plots of the power consumption by season:

```bwplot(power.numeric.clean~seasons, ylab="Power Consumption")
```

The intent of this post is to delve deeper into single variable statistical analysis of the data. To do so we need to import data from PI System into R, clean it up, and use appropriate analysis and graphics. R proves to be very efficient in enabling the Power of Data!