A couple of month ago I developed a Custom Data Reference (CDR) for R calculations and it looked on the surface fairly simple. But after using it for a bit it became pretty clear that this solution is not scalable.

The main problem when using a lot of R functions is that most of them require homoscedastic time series – data need to be sampled with equal spacing. Due to exception and compression PI data are heteroscedastic and therefore need to be interpolated before using them in most R packages. In the CDR, I used the interpolation function of the PI system to do the work, which made the workflow easy. In event driven calculation mode this is:

New event arrives

Get n interpolated values during a time span

Submit the samples to R

Calculate in R – I used linear regression as an example

Get the result (slope) back

Write to PI

In history recovery mode, a single CDR would just slow down my development system severly, so I didn’t even try to do the same for let’s say 100 attributes.

So the basic idea to improve performance is to develop an R Analysis service that takes of the load from the PI server and performs all the data conditioning and the calculation. The requirements are:

Each calculation\point is indepently configurable

Calculations are event or time triggered (elapsed)

Interpolation are performed on the service

Interpolated or Recorded data are sent to R

Results are written back to a PI tag

Each window calculation is in sequence - this is required to carry over calculation between time frames

Since R is single threaded I thought that ~ 500 calculations per service would be good goal.

To isolate each step, I created three worker threads:

Interpolation

R Calcs

Write to PI

and used Rx for all event driven components.

When I first tested the application, I measured around 1 - 2 milliseconds for the R calls, which I thought was fast enough for most purposes. That turned out quickly to be a wrong assumption ….

Let’s say you have a source tag with 5 sec sampling rates and you perform event based calculation. For a history recovery of a 3 month period this would translate to:

5 sec/sample = 1200 values/hour = 28,000 values/day ~ 2.5 Million values/ 3 months

1 – 2 milliseconds latency per R call means you can process 500 values/second, so for the 2.5 Million values it would take ~ 1.5 hours.

So 500 calculations per server would be unrealistic and there would be also a significant lag introduced in the real-time application (0.5 - 1 sec).

To speed up the calculation there are only 2 know knobs you can tweak:

Reduce the number of R calls from Net

Optimize the R function itself

To reduce the number of calls from NET, I changed the R Code to accept a matrix instead of a vector. This allows to perform multiple calculations on the R thread, which is especially important in history recover. The following shows an example of the modified regression function:

regression <- function(m,WindowSizeInSeconds,NoOfPoints) { # stuff that can be calcualted for all evals span<-WindowSizeInSeconds/(NoOfPoints-1) # build the model matrix x<-matrix(c(rep(1,NoOfPoints),seq(0,WindowSizeInSeconds,span)),nrow=NoOfPoints) # x<- matrix(seq(0,WindowSizeInSeconds,span),nrow=NoOfPoints) # actual worker function calc<-function(y){ temp<-lm.fit (x, y) #temp<-lm(y~x, method = "qr", singular.ok = TRUE) return(temp$coefficients[2]) } # calculate for each row of matrix apply(m, 1, calc) }

R offers a couple of functions for benchmarking functions and I found that 'microbenchmark' works really great. The following shows the difference between 2 approaches to calculate the linear regression:

Unit: microseconds

expr min lq mean median uq max neval

lm(y ~ x) 820.018 852.420 1025.02731 951.4145 1163.563 1796.691 100

lm.fit(x, y) 47.711 53.835 63.00429 59.3205 66.847 140.071 100

The 1st expression(lm(y ~ x)) is the recommend approach to do linear regression and each function call takes ~950 micro seconds or ~1 milliseconds. The 2nd one is a stripped down version and as you can see much faster @ 60 micro seconds. So just by changing the R function you can achieve a 15 fold increase in performance.

The following shows the R Service performance using the new approach:

The yellow curve shows the interpolation queue, which is able to process 450,000 interpolation in under 10 sec. The green line shows the R calculation queue, which processes 450,000 linear regressions in R in ~45 sec. The PI system (blue curve) obviously did not have any problems to write the results back to the data archive.

The next figure shows the linear regression in blue of a sinusoid in green:

In summary, the R service approach is promising. I haven't found many R applications that process that many events, so some of the solutions might be unique to this problem. Benchmarking the R functions will be always a necessary step to evaluate at what sampling rate the calculation can be deployed. As I mentioned there are large differences between the performance of different R functions\packages. The main reason is that these functions often are not designed for speed but to get a wide range of statistics. So for an interactive sessions these function are very convenient, but might not be ideal for real time applications.

The next step will be more testing and some more examples. My focus is on signal processing and forecasting, but let me know if you have other applications.

I'm also looking for beta tester - people that are brave enough to run this on their machine ...