17 Replies Latest reply on Jun 18, 2018 9:25 PM by rborges

    Extracting really huge amounts of data from PI, outside-the-box ideas?

    Roger Palmen

      Hi All,

       

      We all know there are several ways to extract large amounts of data from PI using PI Config or a custom tool using PI AFSDK. (The options basically here: How to write PI Data to files in XML format? )

      But even these have their limitations. I've built some tools for that but i typically don't get near 1M events/second for a hostory extraction. And with millions of tags and several years of data to process, that is too slow. Setting aside the causes of that, i started thinking outside the box.

       

      We currently have a process in place where we restore the PI Archives to a separate PI Server, and extract the data from that to other files before sending that off to another platform. To maximize performance, the first idea is to remove as much components as possible from the solution, and here that would be PI. Why do i restore files, and then use two applications to read those files (PI Data Historian) and write these to other files (custom AFSDK application)?

       

      In other words: would it be possible to read the PI Archive files directly to Transform the data to a different format?

        • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
          Roger Palmen

          Thinking a bit firther. PIArchss and PIArtool both read the data in archive files, so they would be great candidates to do some transcoding to just dump out the contents of an archive file into a public format.

          Not sure if OSI wants to go down that road, but it could be one of the fastest ways to pull plainly all data from an archive file.

          • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
            vkaufmann

            Hi Roger,

             

            Whats the end goal here? You quote a data rate of 1M events/sec which seems outrageous for any system. Where does this number come from? In my opinion, read speeds aren't going to be performed by any application better than the archive subsystem reads the archive files since everything is built and optimized for that specific data flow. I don't think there is any performance to be gained going down the route of a public archive format. The fastest reads are going to be had by getting your data into memory which can prohibitive for obvious reasons.

             

            --Vince

              • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
                Roger Palmen

                Hi Vincent,

                I think 1M ev/s is not an outrageous scenario.

                For a streaming read from a PI Server we recently worked on a scenario to continuously receive all updates of >1M PI Points, and that worked fine reaching speeds of over 20M events/minute on my laptop.

                Now history is a different beast than snapshot i agree. But still these numbers are not that far out. I we have 1M PI Points, 10y of data, and expect on average 1 event per minute, that equates to roughly 5.25M events per Pi Point. To extract that amount of data let's say within 1 month, you already need to extract 2M events per second from the archives to achieve that.

                Now we all agree that the main thing we are doing here, is to shuffle data from A to B, and that it would be very feasible to restore a PI Server of this size from A to B within a few days max. So that sets the stage...

                 

                Agree fully here that the Archive Subsystem should be the most optimized way to extract data from the archive files. Now the main thing is how to configure / tune an archive to do just that single job very well?

                Ideally i'd like some specialized tooling on the backend, but i am aware that OSI may have it's reasons not wanting to do that. But if i would have access to the proprietary binary format of the archive files, i would do just that.

                 

                Now back to the more feasible scenario. I already built a tool that segments the data calls to read all data from PI Points in sequence of time, so that i read e.g. all data for one day, then the next. That would allow me to have PI and Windows to pull one specific archive into memory and read that most optimized. But there's a lot more tuning to do...

              • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
                jsorrick

                Not sure how you are using the data but for me I started playing around with the "Filter Expression" option and adding multiple tags the filter. The way our site "validated " performance was they pulled 1 minute data ( which is interpolated) then filtered the data in excel. Huge sheets and really slow. I started using PI "filter expressions" and "calculated data" instead of inside the excel doing the integration and it really sped things up for us.

                  • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
                    Roger Palmen

                    I think in Excel i would quickly hit the 2billion row limit... Excel is just not an option due to the volume and required transaction speed.

                      • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
                        gregor

                        Hi Roger,

                         

                        I am watching your question since you've asked it and I feel you hesitate answering Vincent's question for the use case. We consider the PI Data Archive as The system of record for time series data as we consider data loss as worst case, aim to avoid such situations, have mechanism in place to recover data e.g. from corrupted event queues. Technical Support cases dealing with data loss are treated priority cases.

                         

                        The question must be allowed: For what purpose do you want to extract the data?

                         

                        I second with Vincent's assessment that PI Archive Subsystem is likely most efficient to retrieve compressed data but believe that cracking the 1m events/sec limit should be possible. I haven't tried it though.

                         

                        I know you haven't asked but feel it is worth mentioning that reverse engineering is not a valid approach. 

                         

                        You mentioned that you use a separate machine for data extraction. I hope, it is not a Virtual machine. Solid State Disks (SSD's) offer great performance. They may not be the ideal choice in production but for your purpose they are great. If you combine multiple SSD's into a Raid 0, you almost increase read performance by the amount of SSD's combined in the Raid. I also suggest using another physical Raid 0 for the output files and having a separate disk for the Operating System where this easily can be a rotating disk. RAM should be large enough to easily allow loading a complete archive file into memory.

                         

                        The next thing which I would consider is the period covered by each particular archive file. You don't want to have PI Archive Subsystem load them from disk to RAM multiple times but ideally only once. I suggest going point by point and without knowing the data density, breaking the period covered by a particular archive into smaller bites.

                         

                        Parallelism may help and depend on the amount of cores the CPU has. The challenge here is to find the optimum amount of multiple parallel threads. As soon as PI Archive subsystem threads or disk operations have to wait for each other to complete, you have passed the optimum and should reduce the amount of parallel threads again. Performance Counters can help to identify system limits. 

                         

                        For sure it is too late to change how data has been processed into archives but worth mentioning the importance of having compression enabled. Even compdevpercent = 0 usually helps to reduce the amount of events which make it to archive files dramatically without loosing precision! By reducing the archive size by factor 2, one also reduces the time to spend with data extraction by the same factor. Also the point type has a huge impact e.g. whenever possible Digital should be preferred upon String.

                        3 of 3 people found this helpful
                          • Re: Extracting really huge amounts of data from PI, outside-the-box ideas?
                            Roger Palmen

                            Hi Gregor,

                             

                            I appreciate you chiming in.

                            Unfortunately i can't disclose too much on the case, and even i don't have all details on the background of the case. Basically the customer wants to push all data on certain assets to a cloud platform for deep learning. And for that they say they need all raw data. I'm not too interested in the justification of that question, I chose to limit my involvement to the PI-side of things.

                             

                            Wrt to the extraction, my current approach is quite simple:

                            • Stick to AFSDK, and design surrounding the storage principles of PI (just as you indicated). In practice, this tries to avoid disk I/O as much as possible and retrieve data for all PI tags, from one archive file at a time.
                            • There is no theoretical limit on what the PI Server can do wrt throughput
                            • Any limitation should be related to one or more bottlenecks in the infrastrure used or  the configuration of that
                            • So if testing reveals a limit on the throughput, you must be able to show where the underlying bottleneck is, and address that

                             

                            I too agree reverse engineering is not a valid approach, but just wanted to see what response i got on that. You never know what pops up!

                            The statement on compression it true, but sometimes you just can't explain that by loosing data you don't loose information.

                             

                            Currently this topic is low on my activities list, but still plan to followup in the future.