This post represents my own thoughts and experience on the topic of 'Data Historian vs Data Lake', and is my response to an older post called Discussion: Will the data historian die?, which has just recently been floated in the forums again. It's not intended to be an in-depth technical treatise, but rather it represents my own experiences and observations.
Nearly two years ago I first came across a product called Splunk. The sales guy at the company I was working for had started playing around with Splunk as an option for alarm management. This lead him to eventually come and ask me the question 'can we get PI data into Splunk'? I started doing a little bit of research, and my first attempt at this integration was to download a Splunk SDK and write a C# modular input to Splunk, based on the AF SDK (naturally ). It was a pretty straightforward exercise, and within a couple of hours I had a means of extracting PI point data and ingesting it into Splunk. Suffice to say, the sales guy was pretty impressed, and he proceeded to set up a meeting with his contacts at Splunk to show them what we had done.
That first meeting was what started our journey with Splunk and PI together. The folks from Splunk were pretty excited with our simple integration, and then I started talking about the limitations of what I had created. The concept of a modular input is to push data into Splunk, and firstly I couldn't see this as being a scalable option for large volumes of PI data - a high point or attribute count would require multiple instances of the modular input to be created, and these didn't seem to play too well with multithreading or parallel processing. Secondly, there was the whole data duplication thing. Duplicating data from one system to another in this way just wasn't cool. As we discussed this issue, one of the Splunk guys mentioned a fairly new technology they had developed called virtual indexing. In short, it represented a method of referencing data held in an external system via something they called an External Result Provider (ERP).
Over the next several months we started down this path. I started to develop a prototype ERP for PI, and then eventually handed it over to a graduate engineer as my billable project load took me away from this. Our first attempt at a PI ERP for Splunk was Windows based (C# and AF SDK), with the core functionality focused on referencing PI Points directly. It worked, it was reasonably fast, and left the data in PI while we analysed and visualised it in Splunk. The excitement both in our office and at Splunk was palpable, as we started talk around commercialisation of this as a product.
The native Windows version of this ERP was short-lived, as the first customer that wanted to road test this only had a Linux based implementation of Splunk. As we looked a bit deeper, it appeared that this was the more common approach, and also Splunk Cloud was Linux based. So we redesigned the ERP to be cross platform - re-written in Java and using the PI Web API for PI data access. Not quite as fast as the native AF SDK version, but we eventually managed to optimise it sufficiently that it was nearly so.
As we were both an OSIsoft partner as well as a Splunk partner, we found ourselves in the interesting position of brokering some fairly high level conversations between OSIsoft and Splunk due to the nature of our integration technology between the two platforms. There had been some contact between the two companies prior to the development of our ERP, but this bumped things up a level or two. To say things were 'interesting' would be an understatement. There were some within both OSIsoft and Splunk who saw both companies as competitors, and others as potential partners. Our own take was that Splunk offered some analytic capability that OSIsoft didn't, and that our ERP was a means of leveraging this just as OSIsoft's own Integrators make PI data available to external analytics platforms (Esri ArcGIS, BA Integrator, etc). From our perspective, our target market was customers who already had both PI and Splunk, and we found that there was no shortage of such. We were not interested in selling Splunk to PI customers.
In October 2017 Splunk went to the EMEA Users Conference in London as a sponsor, and myself and the managing director from my then employer went as guests of Splunk. We spent a couple of days in the Partner Expo hall where we demo'd Splunk and our ERP as an integration with PI, with the key focus being integration to Splunk's Machine Learning Toolkit where we analysed our PI data. Overall we had a fairly positive reception from both the customers we spoke to as well as many of the OSIsoft folks who came to see.
Data Historian vs Data Lake
To start with, some definitions are in order.
A data historian is typically regarded as "a complementary set of time-series database applications that are developed for operational process data" (Wikipedia). "Data historians are commonly used where reliability and uptime are critical. The programs are used to gather information about the operation of programs in order to diagnose failures. Data historians are most common in datacenters and industrial control systems" (TechTarget).
A data lake is "a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video)" (Wikipedia). A fundamental concept for data lakes is "schema-on-read", where structure is applied to data when it is read out of the repository. In a platform like Splunk, this is done using search processing language (SPL) queries.
As a long term PI guy, a key observation about Splunk is that the majority of data that people tend to push into Splunk is text based. Log file data is probably the most common form of data that is ingested into Splunk, and I've even written an earlier blog post on analysing PI Message Log data with Splunk. Having said this, we found that when integrating PI data with Splunk either through our original modular input attempt (ingesting data into Splunk) or using the ERP (virtual indexing) that the data is also converted to text. This was one of the key performance issues we encountered with the ERP - all data was converted from the JSON returned by the PI Web API into a CSV format that Splunk required, but it was all text. Why is this a problem? Because working with text has overhead when you want to do numeric stuff with it and the textual representations of numbers have to be converted to numbers on the fly to perform any form of calculations. In the 7.x release, Splunk introduced a new type of index called metrics. It's more oriented towards time series type data, and actually stores numeric values, and not text representations of numbers. Unfortunately, using a virtual metrics index wasn't technologically possible with our ERP, and getting PI data into a metrics index via our ERP would require that the data be permanently ingested into that index, thus taking us back to the data duplication problem.
One of the 'advantages' I often heard in respect to Splunk was its schema-less storage of data - "It doesn't limit what you can do with the data or force you into viewing it a certain way", "You can easily discover new relationships between data", or "You can create new mashups of data that seem otherwise 'unrelated'". All these things are true. They are also equally true with process data stored in the PI System. One of the best examples I have seen of this is discussed by a power customer here during their San Francisco PI World presentation (AGL Energy’s Real-Time Data Journey Continues). There are many other examples of this across all industries, across the PI customer base.
As an end user, getting data out of Splunk is done by crafting queries using their Search Processing Language (SPL). Before you look at that last link (which is to the SPL documentation), first look at the blurb here. During my 10 years as a trainer with OSIsoft (and in the years since I started working as a system integrator) I've trained PI users of all skill levels, from advanced users through to people for whom simply logging on to the computer was itself a technical challenge. What I have noticed though is that most people seem to be able to very quickly start getting data out of PI using tools like PI Vision, PI ProcessBook or PI DataLink and start doing amazing things with it. Now look at the SPL documentation link. This is how you get data out of Splunk, whether you're an IT professional or an end user. Sadly, it seems that many end users of Splunk data are going to get their information through dashboards that someone else (usually an IT pro) has created for them, because SPL is more complicated than many users care for. Business users don't deal very well with SQL, let alone an SQL-esque query language with pipes between commands. Even the visualisations you create are via SPL - you pipe data to specific charts or visual widget types, and effectively configure these in the SPL. Splunk is still very much geared towards IT users.
Data connectivity should also be addressed here. In the past couple of years Splunk has been moving towards industrial and OT connectivity. A key component of this is a new module called Industrial Asset Intelligence. At my previous company we saw this from very early in its development, and one reason Splunk were excited to work with us was the possibility of using our connector between PI and Splunk to bring operational asset data into Splunk, and potentially IAI. There is currently very limited connectivity to industrial control systems - an OPC add-on is available to access data from OPC servers. Other industrial protocols just aren't there for Splunk. Compare that to the connectivity available for PI - more than 450 interfaces to almost any data source and protocol you can think of. Now I've seen some people using Splunk to capture their process data - I met a guy from a company that owns a number of wind farms (which are often operated by others) and they were pulling their data straight into Splunk. They had some very pretty Splunk dashboards to show what was happening at their wind farm assets. But they didn't have granularity into the operating data of the turbines themselves - all they could see was power generation information, some weather related stuff (wind speed etc), and a very high level of turbine performance metrics. The ability to see into what was happening with the actual equipment, the ability to drill into equipment issues wasn't there. They still had to rely on access to this data from within the onsite control system, and that was only available to the operators. What data they had ultimately came from ingesting data files into Splunk - direct machine to machine communication wasn't the integration path used.
Integration of Splunk with other systems such as enterprise asset management systems (EAMS), enterprise resource planning (ERP) systems and similar types of systems is extremely limited from what I have seen. Doing something like integrating PI with Maximo or SAP in a condition based monitoring use case is not unusual, and can be done with OSIsoft's data access technologies. Doing the same thing with Splunk is a different story - some vendors provide their own Splunk add-ons but they tend to be of limited utility in this kind of use case.
Splunk has been positioning themselves as an IoT/IIoT repository in recent years, and it seems they have better connectivity in that area, but I don't have any real exposure to that.
Earlier I mentioned using PI data with Splunk's Machine Learning Toolkit (MLTK). This was a lot of fun, and you can do some cool things with this. It's Python based, and it is possible to create your own scripts with data models and analyses and add them into the Toolkit, but I never looked into how easy or difficult this was. Overall, analytics in Splunk is very different to analytics in the PI System. Creating analyses in Splunk is again done via the SPL. Once you have extracted data in the schema you desire, you then pipe it to additional SPL commands to create calculations. Depending on what you want to do, there are some GUI based options for generating that part of the SPL query from within the MLTK, but you will likely end up cutting a good chunk of the SPL by hand. There are some things you can do in Splunk analytics that you can't do in PI, and vice versa. Time weighted calculations are not a thing in Splunk, therefore a limitation when working with time series process data.
The other area that often came up was licensing costs. Splunk is licensed by the amount of data you ingest, not by datasource (tag) count. Typically you buy a license that allows you to ingest X volume of data per day, such as 2GB, or 10GB, or whatever you need. To those of us in the PI world (no pun intended), 10GB of data per day might sound like a lot. There are some very large Splunk users that ingest terabytes of data per day. When we were developing our PI connector for Splunk, we saw that converting data from PI into Splunk's required CSV format expanded the size of the data we pulled from PI. With the tag name, timestamp and value, as well as the required data packet headers for the ERP, a single event from PI would weigh in at over 100 bytes. Really? That much? Wow <sarcasm />. For AF connectivity, adding the AF attribute name (and AF path) increased even further the size of each event ingested into Splunk. However when you are paying by the GB for data going into Splunk this becomes a big deal. With the virtual indexing technology, data accessed via an ERP such as ours still counted against the daily data ingestion limit in Splunk. Multiply that 100 or so bytes per value by tens or hundreds of thousands (or even millions) of events and you are starting to talk about large quantities of data for every query to PI. Data in the Splunk virtual index has a very short lifespan in the local cache, so most of the time you are going to be making a call back to the PI system and re-ingesting that data into Splunk. Even if you were ingesting that directly into a standard Splunk index instead of using virtual index technology, the data usage (and storage) is much higher than with PI point types. Archive storage is PI is also much more optimised via its record structure - PI point names aren't required to be stored with every data event, nor the full timestamp. 10GB of data stored in PI archives represents a significantly larger dataset than would the equivalent data volume of point based process data in Splunk.
Ultimately the question arises as to how does the cost of Splunk licensing compare to PI licensing. Splunk can look cheaper ($25K per year for 10GB/day of data anyone?), but it can become remarkably easy to ingest 10GB per day into Splunk. The cost per GB/day goes down when you buy a higher volume license ($2500 per GB for 10GB/day compared to $1500 per GB for 100GB/day), but you can still be looking at some serious coin for between 10-50GB per day licensing. Do you buy a perpetual license, or an annual subscription license? There is a cost difference per GB there too ($2500 per GB for 10GB/day perpetual compared to $1000 per GB for 10GB/day for an annual subscription), but then you are paying the same amount every year on a subscription basis (unless the subscription cost goes up). With PI you buy a one-off license and then 15% annual SRP for support and maintenance.
Which leads into the next part of this discussion. OSIsoft technical support is great. It was great when I worked there, and it's still great today. The 15% SRP that we pay with our PI license gives us 24x7x365 support for all of our PI related issues. Splunk support is very different from what we are used to when dealing with OSIsoft. More often you are probably going to rely on forums such as Splunk Answers or maybe Stack Overflow, where both of these communities are very different places (I've seen some very snarky responses to questions in both forums, but there are also some great community contributors). Paid support for Splunk is very SLA driven - you won't necessarily get a quick resolution to your issues (or even a quick initial response depending on the assessed criticality of your issue).
I've rambled enough and need to wrap this up. Could Splunk be a suitable alternative to PI? Referring back to Keith Ward's question in the post I mentioned at the very beginning, when an IT sysadmin asks "why do we need PI when we have Splunk", this should be answered as a 'right tool for the right job' type of response. Splunk is great for IT data - this is where Splunk started, and where it's strength still lies. It's not so great for time series process data (besides seriously lacking the connectivity into OT systems). Splunk is trying to play catchup in this space, and maybe one day it will get there. But for now, trying to use it as a full featured process data historian optimised for time series data is a bit like choosing Microsoft Access for your enterprise data warehouse. It's a database, right? Use the right tool for the right job.