2017 November Webinar: Leverage the Right Technology to Avoid a Data Swamp

Document created by pisqwebinars Employee on Nov 15, 2017Last modified by pisqwebinars Employee on Nov 17, 2017
Version 6Show Document
  • View in full screen mode


Q: What is meant by "schema"?

A: A schema is a definition of how a database is built. It defines the content of tables and the relationships between records in the tables.


Q: Can you give some examples of TSDB?

A: Examples of Time Series Databases are OSIsoft PI System, Honeywell PHD, and Yokogawa Exaquantum.


Q: What is the recommended approach for Data Cataloging like PI AF or a Data Lake, where data is redundantly collected. I am looking for a clarity around redundancy in data lakes and data catalogs – a Data Lake requires you to throw a copy of (timely data), that already exists in a different database?
A: Copying data always has the downside of data synchronization. It is important to have a good understanding of the use case and to understand the impact of case updates in the historian are made and not transferred to the data lake. In a real-time environment it is not uncommon that data is coming in late (data latency). Example: In the case of a network failure or other technical issue, it can happen that the connection between the interface and the historian is broken. The interface will start buffering the data and once the issue is fixed it will push the data to the historian. This delay could be minutes, or in the case of a serious network failure, days. The result is that the interface between the historian and the data lake will not capture the data from the outage as it doesn’t know it show recover it. An alternative approach would be to request the data from the historian at the moment it is needed in the data lake. Some data lake technologies have the capability to use pass through queries and pull data from the historian as soon as a view in the data lake is referencing it.


Q: Who are the big players in the Data Lake industry right now and do you really need a Data Scientist to start getting information out of one?
A: A few key players in the data lake world are Hadoop, Microsoft Azure, SAP HANA, Amazon S3. In cases where the right tooling is used, a data scientist it is not always needed to do the analyses. A few partners of OSIsoft deliver toolkits to bring big data analytics to business users. Examples are Element Analytics, Trend Miner, and Seeq.


Q: Do you consider GE Predix to be a Data Lake?
A: The conceptual design of GE Predix is based on Open Source technology. The cloud engine of Predix is based on Cloud Foundry. This is an open source toolkit to build a PaaS developer framework. At the backend, data lake technologies like MongoDB are available for the Predix toolkit to store data. As it is open source it allows other applications to access the data.


Q: You mentioned PI AF as a tool for structuring real-time data.  Are there viable standards to base this on - or will each company need to create their own Asset Framework/Data Model?
A: Industry standards for Digital Twins are in progress. An example is ISO 15926. Toolkits like Element Analytics help provide a base set of templates of equipment definitions. Also on PI Square you can find good examples of equipment templates.


Q: When you refer to a data lake for real-time data, are you talking about Spark rather than Hadoop?
A: Apache Spark is more a general engine for large scale data processing. It is able to connect and process data from a variety of sources. These data lake source can be Hadoop, Apache HBASE, or Cassandra. In the context of this presentation we refer to technologies like Hadoop as a data lake for real-time data.


Q: Do you support edge computing so that most RT data stays on the facility for operational intelligence?
A: There are many use cases where data processing on the edge is a very useful. An example is running data analytics on a remote compressor in a gas distribution network. Instead of pushing all the vibration data to a cloud based data lake, the vibration data is processed locally, alerts are used locally, and only the analysis results are forwarded to a central store.


Q: Could elaborate on Standard Definitions of Equipment?
A: A standard definition or Digital Twin is a digital representation of physical equipment. For all dynamic parameters there is a sensor and the sensors are combined into a template. Here’s an example template:
Pump    - Power consumption

- Flow

- Inlet pressure

- Outlet pressure

- Filter delta pressure

- Bearing temperature

- Bearing vibration

Q: What do you think of technologies like Seeq for improving the operational intelligence interface?
A: Seeq and other tools like Trend Miner and Element Analytics have simplified the data preparation and data analytics in such a way that big data analytics is now possible for non-data scientists. Also for these tools it is important to have the data prepared. These tools run perfectly on top of a modern real-time infrastructure in combination with a data lake.


Q: Will historical data be stored in data lakes?
A: The raw data will most likely stay in the real-time infrastructure. The cleansed, contextualized, and formatted data will be transferred to the data lakes. In the data lake, data can be integrated with data from other types of sources. Uses cases like advanced analytics or applications like dashboards and corporate reporting will use the preprocessed data in the data lake.


Q: If you have different data sources, how do you analyze the data in a data lake without the right time stamps or a reference signal
A: This is indeed a challenge. For this reason it is important to preprocess all data in a real-time system and send time synchronized data to the data lake.


Q: There is more and more hype about edge computing and real-time analytics, i.e. data that would be analyzed before it is stored in a data historian, like the PI Server. Do you think that this is possible? What about data contextualization and integration?
A: Like in the gas compressor example above, it will be indeed useful in some cases to do the processing on the edge. The infrastructure should enable context synchronization between the central system and the edge system. In case of OSIsoft it is possible to run a full environment including AF on the edge.


Q: Are there any use cases where PI System data is being pulled into a Data Lake and the business is using the data to improve its production?  If so, where can I find them?
A: There are some good use cases available. In PI Square you can find several presentations from previous Users Conferences. A nice example is the Deschutes Brewery presentation.


Q: There's a lot of hype around data lakes and I'm thinking it may be overkill in many contexts.  What might be some example use cases where one would use a data lake versus using just vanilla PI?
A: Good examples are Machine Learning based on many years of RT data in combination with maintenance records. Other examples are corporate dashboards with data from multiple historians in combination with ERP data.


Q: Does the collection of Alarm and Event data present a problem?
A: For processing A&E data, correct time synchronization is a must. Especially when the sequence of events is important for analyzing what happened in the production process. Also in this case the hybrid solution will help to create the time synchronization.



Watch On-Demand version of this webinar

001683_Operational Intelligence – Data Infrastructure _ PI System – OSIsoft_2017-11-15_1346.png

Download Slide decks

001687_MASTER 2017 November Webinar FINAL.pptx - PowerPoint_2017-11-15_2010.png

Read White Paper

001682_wp-DataLakeDataSwamp-lt-en.pdf - Adobe Reader_2017-11-15_1331.png

Watch John speaks in 2017 Houston Regional Seminar

001691_OSIsoft Regional Seminar HOU - Digital transformation.pptx - PowerPoint_2017-11-16_1733.png






John de Koning