A couple of month ago I developed a Custom Data Reference (CDR) for R calculations and it looked on the surface fairly simple. But after using it for a bit it became pretty clear that this solution is not scalable.

 

The main problem when using a lot of R functions is that most of them require homoscedastic time series – data need to be sampled with equal spacing. Due to exception and compression PI data are heteroscedastic and therefore need to be interpolated before using them in most R packages. In the CDR, I used the interpolation function of the PI system to do the work, which made the workflow easy. In event driven calculation mode this is:

         
           New event arrives

           Get n interpolated values during a time span
           Submit the samples to R
           Calculate in R – I used linear regression as an example
           Get the result (slope) back
           Write to PI

 

In history recovery mode, a single CDR would just slow down my development system severly, so I didn’t even try to do the same for let’s say 100 attributes.

 

So the basic idea to improve performance is to develop an R Analysis service that takes of the load from the PI server and performs all the data conditioning and the calculation. The requirements are:

 

                Each calculation\point is indepently configurable
                Calculations are event or time triggered (elapsed)
                Interpolation are performed on the service
                Interpolated or Recorded data are sent to R
                Results are written back to a PI tag
                Each window calculation is in sequence - this is required to carry over calculation between time frames

 

Since R is single threaded I thought that ~ 500 calculations per service would be good goal.

 

To isolate each step, I created three worker threads:

               Interpolation
               R Calcs
               Write to PI

and used Rx for all event driven components.

 

When I first tested the application, I measured around 1 - 2 milliseconds for the R calls, which I thought was fast enough for most purposes. That turned out quickly to be a wrong assumption ….

Let’s say you have a source tag with 5 sec sampling rates and you perform event based calculation. For a history recovery of a 3 month period this would translate to:

5 sec/sample = 1200 values/hour = 28,000 values/day ~ 2.5 Million values/ 3 months

 

1 – 2 milliseconds latency per R call means you can process 500 values/second, so for the 2.5 Million values it would take ~ 1.5 hours.

 

So 500 calculations per server would be unrealistic and there would be also a significant lag introduced in the real-time application (0.5 - 1 sec).

 

To speed up the calculation there are only 2 know knobs you can tweak:


               Reduce the number of R calls from Net
               Optimize the R function itself

 

 

To reduce the number of calls from NET, I changed the R Code to accept a matrix instead of a vector. This allows to perform multiple calculations on the R thread, which is especially important in history recover. The following shows an example of the modified regression function:

 

regression <- function(m,WindowSizeInSeconds,NoOfPoints) {
  # stuff that can be calcualted for all evals
 span<-WindowSizeInSeconds/(NoOfPoints-1)
  
  # build the model matrix
 x<-matrix(c(rep(1,NoOfPoints),seq(0,WindowSizeInSeconds,span)),nrow=NoOfPoints)
  # x<- matrix(seq(0,WindowSizeInSeconds,span),nrow=NoOfPoints)
  
  # actual worker function
 calc<-function(y){
    temp<-lm.fit (x, y)
    #temp<-lm(y~x, method = "qr", singular.ok = TRUE)
 return(temp$coefficients[2])
  }
  # calculate for each row of matrix
  apply(m, 1, calc) 
}


 

R offers a couple of functions for benchmarking functions and I found that 'microbenchmark' works really great. The following shows the difference between 2 approaches to calculate the linear regression:

 

Unit: microseconds

         expr          min      lq       mean   median       uq      max neval

lm(y ~ x) 820.018 852.420 1025.02731 951.4145 1163.563 1796.691   100

lm.fit(x, y)       47.711  53.835   63.00429  59.3205   66.847  140.071   100

 

 

The 1st expression(lm(y ~ x)) is the recommend approach to do linear regression and each function call takes ~950 micro seconds or ~1 milliseconds. The 2nd one is a stripped down version and as you can see much faster @ 60 micro seconds. So just by changing the R function you can achieve a 15 fold increase in performance.

 

The following shows the R Service performance using the new approach:

 

The yellow curve shows the interpolation queue, which is able to process 450,000 interpolation in under 10 sec. The green line shows the R calculation queue, which processes 450,000 linear regressions in R in ~45 sec. The PI system (blue curve) obviously did not have any problems to write the results back to the data archive.

 

The next figure shows the linear regression in blue of a sinusoid in green:

 

In summary, the R service approach is promising. I haven't found many R applications that process that many events, so some of the solutions might be unique to this problem. Benchmarking the R functions will be always a necessary step to evaluate at what sampling rate the calculation can be deployed. As I mentioned there are large differences between the performance of different R functions\packages. The main reason is that these functions often are not designed for speed but to get a wide range of statistics. So for an interactive sessions these function are very convenient, but might not be ideal for real time applications.

 

The next step will be more testing and some more examples. My focus is on signal processing and forecasting, but let me know if you have other applications.

I'm also looking for beta tester - people that are brave enough to run this on their machine ...