I am going take a jab at this. I am not familiar with Apache Spark. The recommended method for interfacing PI System data with business intelligence tools is PI Integrator for Business Analytics.
This is the Reference Architecture for Streaming Analytics that was presented at EMEA USERS CONFERENCE 2017 LONDON.
I noticed this from Spark documentation:
From the Spark documentation, Creating streaming DataFrames and streaming Datasets section,
Socket source (for testing) - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Having said the above, if you are planning to leverage PI Web API, here is what I would try.
You can use the Python client library to set up a web socket to receive PI System data. This data(in a chosen format) can then be written into the desired TCP socket & port to be consumed by Spark.
Python3 Socket Programming
Again, this is just a suggestion and there might be cleaner methods to accomplish your goal. I will let rest of the PI community to help you with that.
I am also tagging Marcos Vainer Loeff who created the client library for his inputs.
Thanks a lot for sharing your thoughts, Thyagarajan Ramachandran!
Having some sort of Event Queue in between PI and Apache Spark is probably the standard way to do it. In the presentation it's Kafka, but could be Azure Event Hub as well.
I'm not very familiar with PI Integrator and it seems to me just a data transportation tool, not capable doing more complex calculations (not simply Avg, Sum, etc...)?
That's why we landed on PI Web API with Python.
Did you try the option suggested in my reply of reading the data via PI Web API and then writing to a socket?