7 Replies Latest reply on Apr 10, 2018 2:34 PM by Lonnie Bowling

    export big data from pi server

    ElsaHuang

      Hi gurus,

       

      Hope you could spent few minutes to read my questions and do me the favor, please~~~!

       

      our company have encountered a tough problem that, we are using pi server to track/store the Latitudelongitudespeedheight and direction angle real-time data sent by GPS devices, which installed on the automobile.

       

      we do have nearly 28.6TB real-time data (40 million points/tags,  PI version: 2008) separately stored in 3 groups of PI servers, and need to export from pi server as .txt format and do some data clean work, then import into HBase, otherwise the 3rd one is almost full and we have to add the 4th group of PI server to support explosive real-time data, its not a smart working way but we do not have any good solution currently. There were some problems while designing the database at the very beginning, which result into the redundancy of data and it only need one group pi server indeed.

       

      Our architecture estimated it requires 4-5 months to do the export data work by writing a script to read the data 100kb/sec per sub-process, and only 60% success rate. Generally it requires 1-2 years to do the export work of historical data, however it doesnt make sense to spend such a long time.

       

      I believe there must be some more better way to achieve this goal, could any one do us a favor for this? Thanks a lot!

        • Re: export big data from pi server
          André Åsheim

          Hi,

          Exporting 28 TB of PI data into a txt file is a really bad idea.

          The PI Archives are optimized in size to store plenty of real-time data when you export these data into a txt file you will get much larger files.

          I'd strongly recommend developing/scripting a direct data export to the end-system. If a txt approach is the only viable:

          1. Upgrade the PI servers for better performance

          2. Script a PowerShell script to do the actual data export.

            • Re: export big data from pi server
              ElsaHuang

              Thanks Andre for your answer. 

              We hope to try some default command way to do the data export provided by PI servers, if it cannot support, we are already on our way to script.

                • Re: export big data from pi server
                  tramachandran

                  Hi Elsa, your version of the PI Server (2008) is outdated to make use of modern libraries like AF SDK which can significantly improve data retrieval. I would recommend to upgrading to at least version 3.4.380+.  

                  If your scripts use PI API or piconfig, for better performance you might want to run them directly on the server to avoid overwhelming the network. Retrieving larger number of values may potentially result in a higher load/resource usage on the server slowing down the overall performance.

                  • Re: export big data from pi server
                    Lonnie Bowling

                    Hi Elsa,

                     

                    I think with the amount of data you are working with, you should looking at few things:

                     

                    #1 What is the fastest way to get data out of PI - quick answer, use the AF SDK + c#

                    #2 How can you scale this operation, such as:

                         Using batch runs, with concurrent threads, running on one or more servers (this can then scale as needed)

                         You will have a bottleneck when scaling, if the PI System hardware is the bottleneck, and it is taking too long, you may need to expand (i.e. add nodes to a collective)

                         Have your PI data archives on SSDs, it improves the performance of data reads dramatically, but this might be expensive based on the amount of data you have, but even cheap SSDs might be worth a test

                    #3 Can you do some data cleanup in the PI system before exporting (moving more data than you need takes more time, and the clean-up work has to happen somewhere)

                    #4 Do you have to move all the data? Starting out with a subset, or making sure there is no way to do what you need to do within PI, should be considered, a full copy should be a last option

                    #5 Avoid any coping of data to files that create a write to a hard drive (or SSD), do a direct import to HBase, read data from PI to memory, clean-up data while it is stored in memory, then write data to HBase

                    #6 If this is not just a one-time thing, but on-going, then I would consider creating a master database (SQL would be a good fit) that can be used to schedule batches, spin off jobs, and track status

                     

                    As the amount of data increases, simple things can become complex, but it is all possible!

                     

                    Hope this helps,

                     

                    Lonnie

                    4 of 4 people found this helpful
                      • Re: export big data from pi server
                        ElsaHuang

                        Hi Lonnie,

                         

                        Greeting! Thanks for your quite detailed steps and clear explanation. Really appreciated!!!

                         

                        #1 What is the fastest way to get data out of PI - quick answer, use the AF SDK + c#
                        As per Thyagarajan's comments, AF SDK could be available by upgrading into latest version. However we won't take this risk if the upgrade may cause unpredictable loss. We do only have one real-time database running on the server, no more backup data, its in quite a vulnerable status.

                         

                        We are trying to abandon using PI server afterwards due to it doesnt have enough technical support/consultants in Chinese market. And we found that Chinese leading automotive internet service providers they also wont use PI servers to store these massive real-time data generated by GPS devices, the usage scenario of database selection should be considered at the very beginning. You could consider our business model as Uber or Didi(Chinese Uber), but the gps devices installed on the vehicle instead of mobile phone.

                         

                         

                        #2 How can you scale this operation, such as:

                             Using batch runs, with concurrent threads, running on one or more servers (this can then scale as needed)

                             You will have a bottleneck when scaling, if the PI System hardware is the bottleneck, and it is taking too long, you may need to expand (i.e. add nodes to a collective)

                             Have your PI data archives on SSDs, it improves the performance of data reads dramatically, but this might be expensive based on the amount of data you have, but even cheap SSDs might be worth a test

                        Using batch runs with concurrent threads running on bunch of servers”, our architecture already think of this way and planned to do on this way, however, as I noted in my primary question statement, it estimated takes 4-5 months or even 1 year to do this, and the scale also costs a lot I believe. The bottleneck we are already encountered, we are adding the hardware as the gps devices volumes expanding, its not an economic way and the current overall architect have to be refactored due to it cannot satisfy the growing business requirement.

                         

                        #3 Can you do some data cleanup in the PI system before exporting (moving more data than you need takes more time, and the clean-up work has to happen somewhere)

                        No I believe we cannot do this within the PI system, due to those real-time data are quite important business big data, we only have this one running on production servers, no more backup, we cannot bear such a great loss.

                         

                        #4 Do you have to move all the data? Starting out with a subset, or making sure there is no way to do what you need to do within PI, should be considered, a full copy should be a last option

                        We are asking the permission of CEO if we really do need to move all the data. We already set a specific date like April 1st, 2018, the real-time data after this date has been flowing into HBase, but we still need to retrieve those before April 1st, 2018 within the PI servers.

                         

                        #5 Avoid any coping of data to files that create a write to a hard drive (or SSD), do a direct import to HBase, read data from PI to memory, clean-up data while it is stored in memory, then write data to HBase

                        This could be consider as an alternative, but also directly write data to HBase should be proceed in batch.

                        Batch read data from pi server, and batch import into HBase.

                         

                        #6 If this is not just a one-time thing, but on-going, then I would consider creating a master database (SQL would be a good fit) that can be used to schedule batches, spin off jobs, and track status

                        We do use SQL server to cope with PI server during daily operations till now. But at the sometime the architecture and the big data engineer are building up another architect running on linux environment, and SQL can be replaced by mysql or RDS (produced by Aliyun, a Chinese leading cloud service provider).

                         

                        btw, finally our architecture asked if we can use AFExport utility to do this data export, and how to manipulate with this, can someone offer any ideas?

                         

                        Thanks all of your attention on this!

                          • Re: export big data from pi server
                            Lonnie Bowling

                            Hi Elsa,

                             

                            I have not used the AFExport tool, but it just looks like a way to export your AF Database to XML as a backup, so I don't think that is going to help you. So are you still using PI as your production system, sending real-time data to it? If so, I think you are going to have to come up with a migration plan, where moving data and cutting over happen in an orderly manner. If your PI system is fragile (most likely because it is undersized and way behind on updates), then you are going to really challenge it when you start a large export job, regardless of how you actually do the export. It is honestly way to big of a project to discuss in a meaningful way in this forum.

                             

                            Anyway, back to the basic question, how get data out of your PI system as efficiently as possible. Based on your old version of PI and the inability to upgrade, I think the PI SDK is your best option. It is the lowest level, fastest possible way to extract data from PI. I almost feel like you should stand-up an offline PI Server, do a backup/restore from your production system, copy over a subset of data archives, then run your export scripts (using the PI SDK) to copy the data to HBase. But not knowing your system architecture, I'm just throwing out guesses.

                             

                            You have a very challenging problem, I wish you the best of luck!

                             

                            Lonnie