When I first saw Richards presentation, I thought AMI might be a use case. You will have large amounts of data coming in that need a preprocessing to be accountable data. Imagine a case where you have 20 million meters providing 15 minute data. that would make roughly 2 billion events per day per measurement that the meter provides that needs to be processed by a calculation engine (for data gaps, for implausible data, etc.).
The traditional approach of reading all the raw data, and afterward retrieving it again, doing the calculation and writing back the results does not scale well here - I think it has some beauty to do the calculation on the incoming data while it is getting in. For me this would be definitely one use case of stream insight.
I think the possibility to use data from all kinds of real time or database systems is the real power of StreamInsight. The possibility to use data from a RDBMS/other system in your real time calculations will give great flexibility in my opinion.
There is a fundamental difference between a streaming engine and others types of calculations engines. With StreamInsight, one works on a stream of information constituted by a flow of events. It is possible to build something similar with a custom application or PI ACE by signing up for a list of PI tags. But then the difference lays in the way the end users interact with this stream, or collection of events when building analytical rules. And I’ll give a simple example in the context used by Andreas: Imagine that I want to detect outages for a set of power meter knowing that I have a digital signal for each meter reporting the meter status (on, off). And suppose that I want to know in real time the total number of meter being out of service. And to add a little twist, I know that some meters can send false positive outages; so a ‘real’ outage status is counted only when it last for at least 5 minutes. Using PI ACE for example, I would need to:
- Sign up for update on all these tags;
- Check for any update which tag has actually been updated;
- Start a time counter to filter out false positive;
- For any update (event), loop through all the meters to check those with a validated outage status;
- Sum them up.
With a streaming engine such as StreamInsight, one only needs to check the stream of data for events that have last for 5minutes or more in a given state and sum them up.
Conceptually, this might look the same but practically the coding is quite different and far much easier with a streaming engine.
As a practical example, here is an excerpt of the StreamInsight documentation explaining how a filter operator would work when programming with LINQ.
In the following example, the events in the event stream someStream are limited to events in which the value in field i is greater than 10. Events that do not meet that criterion are removed from the stream.
var queryFilter = from c in someStream
where c.i > 10
In the following example, the output of the query is computed by selecting only the events from the stream ratioStream with a value in the id field that is equal to 2.
var filteredStream = from e in ratioStream
where e.id == 2
This shows that the query syntax can be drastically simpler than having to loop through all the collection of streams, which is how we would have to do with PI ACE for example.
I strongly encourage anyone interested in CEP in general and StreamInsight more specifically to take a look at the help file shipped with the CTP2 of StreamInsight. There is a section about the LINQ query templates giving examples about the different operators (projection, filtering, group and apply, joins, aggregations...).
Our job at OSI will be to provide you with input and output adapters that can be used to build PI data streams so you can leverage this very powerful LINQ syntax.
Of course nothing is always ideal and engineering type of calculation are usually harder to implement with a CEP semantic. And since a CEP engine only keeps the data in the stream for the time slice required by the logic to operate, recalculation and queries that need to run against historical data can become problematic.
I hope this will help. Stay tuned and we’ll have more when the PI adapters will be ready for CTP.
There is no doubt that if you go down to an implementation/coding level, you will find differences between the ACE/StreamInsight programming models. LINQ statements are definitely more terse than explicit loops but, as you say, one is conceptually doing the same thing. I am not particularly a fan of ACE; what I am trying to do is position SI relative to existing approaches such as ACE.
StreamInsight is certainly a nice concept, but how much of the hype is real and how much is due to the Microsoft Marketing machine at work? CEP products have been around for a long time and many are much more mature than SI. The best page I found that lists the various solutions is here: http://rulecore.com/CEPblog/?page_id=47.
Here are a couple of paragraphs from an article written by David Luckham, who is considered the father of CEP (and a co-founder of Rational Software). The article compares Event Stream Processing (ESP) and Complex Event Processing (CEP). The article as a whole makes very nice reading:
[ESP] started in the mid-1990s when the database community realized that databases were too slow to do real-time data analysis. They started researching the idea of running continuous queries on streams of incoming data. They used sliding time windows to speed up the queries. An answer to a query would be valid only over the events in the current time window, but as the window slid forward with time so also the answer was updated to include the new events and exclude the old ones. This research was called Data Streams Management (DSM) and led to the event streams processing world of today. The emphasis was on processing the data in lots of events in real-time.
There’s a fundamental difference between a stream and a cloud. An event stream is a sequence of events ordered by time, such as a stockmarket feed. An event cloud is the result of many event generating activities going on at different places in an IT system. A cloud might contain many streams, A stream is a special case of a cloud. But the assumption that you are processing a stream of events in their order of arrival has advantages. It lets you design algorithms for processing the data in the events that use very little memory because they don’t have to remember many events. The algorithms can be very fast. They compute on events in the stream as they arrive, pass on the results to the next computation and forget those events. On the other hand, If you’re processing a cloud, you can’t assume that events arrive in a nice order. You may be looking for sets of events that have a complex relationship. For example, events that should be causally related but are actually independent because of an error. They could be the actions and responses of several processes in a management protocol that are supposed to synchronize and execute a transaction, but sometimes fail. You may have to remember lots of events before you find the ones you’re looking for. In such a case it is critical to know which events caused which others. This takes more memory and more time! On the plus side, you can deal with a richer set of problems, not only event data processing, but also business process management, for example.
Fine. But is not this what PI has been addressing for a long time? As one learns more about CEP, it is hard to find something that you cannot do with existing PI tools. At this point, SI looks more like Microsoft taking one more step into real-time space...
I kind of agree with you Mohamed, CEP has been around for a long time. And while MSFT decision to jump in this market space may seem surprising, I do think that at the lowest level, there are significant differences in the way they are handling time constrains with StreamInsight compared to others CEP products. MSFT is also shooting for very high levels of performance with minimal resources footprint and easy deployment. So I guess time (and customers) will tell if this product will be successful and widely adopted. StreamInsight is in the early stages of productization but research and development has been going on for a while. I have no doubt that MSFT will quickly catch up and strongly compete with the other players in the CEP space.
So what’s in it for OSIsoft and why would MSFT partner with us and vice versa? Well, MSFT first indentified manufacturing as their primary target for the first version of StreamInsight. Partnering with us was obvious given our existing and strong relationship. And of course our expertise in both manufacturing and real time data. We see StreamInsight as another tool in our Analytics kit allowing end users to build queries in a more ‘relational way’ compared to pure .Net programming or PE syntax. It would be unreasonable for us to build a new calculation engine when we can leverage StreamInsight. It would also be a shame not to take this opportunity and leverage a technology and a product that will hit the market with or without us. And having a very close relationship with MSFT early on in the development cycle of StreamInsight allows us to give them a lot of feedback and to make it more attractive to our customers.
For now, our scope is limited to developing PI input and output adapters for StreamInsight but we are looking at others ways to leverage this engine: for example, standardize the data processing capabilities of our interfaces (a.k.a. ‘edge processing’) so that analytics can be pushed very close to the data collection level and can be reused regardless of the data source.
The following article, co-authored by three Microsoft Research people, investigates a compression technique for data collected in Data Centers. The technique is basically a Wavelet method modified so as to preserve spikes in the data. I would not be surprised if there is a follow-up article where this is implemented within a StreamInsight adapter which compresses data before sending it to another SI block or to an archive.
Part of the article discusses how different queries require different levels of compression. Therefore, with time, some advanced users might require some flexibility in choosing the compression technique used for their data. So I imagine that StreamInsight could be, as you say, a candidate for continuing the trend in moving data compression to OSI interfaces. And if things have a plug-in architecture, users could gain flexibility in choosing which data processing rules they want to run at the edge. The real-time infrastructure will be much more distributed than it is now.
Jut thinking out loud...