What is the record format within the Pi Data Archive?
Can I ask what you are trying to accomplish and why you need to know the structure of the archive files?
I'm trying to figure out the best method for efficiently and inexpensively, externalizing Pi data into an on prem database for some advanced analytic purposes. We collect roughly 300,000 Pi Points every second into our Pi Data Archive via many Pi OPC / DA interfaces. So a few questions come to mind.
1) How do I get that much data out of Pi and persist it into an external datastore and keep up?
2) Should I try and store that to an SQLServer2016 database or should I use something like Cassandra?
3) How many bytes do I need to pull out of Pi each second (that's why I was asking about the data structure)
4) How best should I add the AF structure to the data as I store it in the external datastore?
5) I have looked at the Pi Business Analytics integrator (basic and advanced). It looks like I can use the basic version that collects data in batches, then stores it in an external database, but am I limited to a SQL db or can it store data to a noSQL database? Can that integrator keep up with that kind of data rate? Any other way for me to get that much data out of PI efficiently and keep up?
6) Before I persist the data in the external data store, I may need to uncompress the data to make it usable for analytic tools....don't know if the basic integrator has that capability?
7) Trying to calculate how much compressed / exception filtered data has to come out of Pi and go into this uncompressed / unfiltered external database
I don't think the record format of the archive have anything to do with this. But, this is an interesting question.
I think we need to take a few steps back and understand if this is the best approach for your project.
What advanced analytics tool are you using that can keep up with a 300k events/s streaming input?
Do you really need all of those streams in your tool, or can you perform some initial calculations in PI and / or at the source?
You may want to take a look at Data interpolation, is this really a good idea?
Trying to validate an end goal that:
1) Bulk loads migration of historical data from Pi to an Azure data lake and mash it up with other non - real time business information
2) Utilizes Azure Machine Learning to analyze the data to test our hypothesis, find data patterns. Expose predictive models as services
3) Then flow data from Pi to Azure in real time, through Azure IoTHub, then to Azure Stream Analytics which would call the M/L service to test real time data against the predictive model. Based on the predictive model, would likely down sample the 1 second data.
But for now, I just want to get the historical data out of Pi and into some kind of database where we can take some baby steps and analyze the data using R.
Depending how often you want to get the data out of PI - all at once, once a month, once a day, etc. - and if your model necessits raw data or summary calculations, your best options would be:
I would suggest reading through the following previous post / blogs:
Extracting large event counts from the PI Data Archive
KB01216 - AF SDK Performance: Serial vs. Parallel vs. Bulk and The fastest way to extract large amount of data to CSV files
Mass Export Archive data of Tag subset
I would also suggest going through Holger Amort's Blog - he has consistently posted very high quality content on interacting with PI and R and other Data Science subjects.
I agree with Vincent Kaufmann, what are you looking to accomplish? If you are asking about general archive/record structure. I would point you to the following LiveLibrary documentation.
Retrieving data ...