14 Replies Latest reply on Aug 6, 2018 9:09 AM by Roger Palmen

    Best way to compare 25 years data of 4k around tags

    Paurav Joshi

      Hello folks,

       

      New Simple question:

      I already have 2 PI DA- one is old DA O and the other is new DA N. Data has been already copied from DA O to DA N and I am trying to find a lighter approach (in terms of memory and resource consumption) to confirm that DA N has all the data of DA O, so that DA O can be decommissioned. By data I mean here 25 years worth of data of 4k around tags.

       

      Old Complex question:

      We have a case where there are two PI DAs - one old DA going to be decommissioned and the other is going to be used for future reference - with 25 years of data and 5k around tags, and we want to check that both PI DA has almost same data for important 4k around tags (that means m talking here about lots and lots of data). I'm exploring which are the best methods to do this comparison with accuracy >=97% .

       

      One simple solution is using AF SDK get bulk query call for each year time period and do comparisons, as far as I can presume this can become nightmare in production servers.

       

      Any suggestions if this kind of question you have faced. To Rhys Kirk ,Roger Palmen ,Ernst Amort and all others.

       

      Regards,

      Paurav Joshi

       

      Edit:

      P.S.: I'm considering the use of statistical comparisons like F-test.

        • Re: Best way to compare 25 years data of 4k around tags
          Rick Davin

          I'm guessing this is the post you wanted unmarked as answered.  Even as moderator, I can't see where to do that but that may be because there were 0 replies.  This reply is more than a courtesy.  It's also being entered to see if I can change anything.

            • Re: Best way to compare 25 years data of 4k around tags
              gregor

              Hi Paurav & Rick,

               

              There is intentionally no direct way to remove the "Assumed answered" status but this can be accomplished by clicking [Correct Answer] and immediately [Unmark as Correct]. For sure there need to be at least one answer.

               

              This answer does not serve as answer to your question on comparing history. Let's enter this discussion with some questions to clarify what you like to accomplish.

               

              Are the 2 PI Data Archive nodes members of a PI Collective?

              What's the value you gain with the information that both installation are above or below 97% equal? This does not tell you which data is more trustworthy, is it?

              To compare archived events, you can just retrieve RecordedValues from both nodes and compare them. What additional information do you get from the F-Test?

              1 of 1 people found this helpful
                • Re: Best way to compare 25 years data of 4k around tags
                  Paurav Joshi

                  Hey Gregor ,

                   

                  Thanks for the reply.

                   

                  First assumed answer is resolved. I have one suggestion that without any reply to the question the link to Assumed answer should also be disabled.

                   

                  On the main question:

                   

                  Are the 2 PI Data Archive nodes members of a PI Collective?

                   

                  These are not collective, two different standalone PI DA.

                   

                  What's the value you gain with the information that both installation are above or below 97% equal? This does not tell you which data is more trustworthy, is it?

                   

                  Both PI DA should have the same PI data - one is an older server, going to be decommissioned and the other is a newer server, will be used for future references. I want to confirm that PI data of both DA are almost same and that's the reason for limit as 97%.

                   

                  To compare archived events, you can just retrieve RecordedValues from both nodes and compare them. What additional information do you get from the F-Test?

                   

                  I want to achieve this operation with as minimum resource usage on DA side as possible, and I'm assuming F-test can provide me benefit here than AF SDK. This is my hypothesis, which I want to share with you guys so that can get better or contradicting views on it .

                   

                  Thanks,

                  Paurav Joshi

                • Re: Best way to compare 25 years data of 4k around tags
                  Rick Davin

                  Hi Paurav,

                   

                  I think you have the experience with AF SDK to understand the steps involved to retrieve data, even massive amounts of data.  To me, the mechanics of data fetching isn't the interesting thing.  Rather, as Gregor wants to know, are these 2 PI DA's members of the same collective or not?  If they are different standalone servers, or even from different collectives, there is a lot of teeny tiny things that could differ between the PI DA's over the years: compression specs per tag, archive sizes and fill rates, etc.  You could try reading recorded values and comparing but there is a large fuzzy area to that.  For example, one PI DA may return an ArcOffline, which is not a real recorded value, and you certainly want to avoid them.  And if scan rates and compression were ever different, the 2 sets of values for a given time would also be different.  And some may say, this never happens but realistically over a 25 year span with different PI Admins, the odds are greater than not that something was out of sync somewhere over time.

                    • Re: Best way to compare 25 years data of 4k around tags
                      Paurav Joshi

                      Hey Rick ,

                       

                      I should have mentioned my query better way:

                      I have an old DA O, which m going to decommission and new DA N, which other team should have supposedly filled using archive merging with all data of O. I want to confirm that both DA has almost same data.

                       

                      Why am I looking for the alternative than using AF SDK?

                      My hypothesis is that If I pull data worth of 25 years using AF SDK and compare them one by one it is going to be time and resource consuming exercise.

                      So I'm looking @ other methods in which pulling out data from both DA and compare operation can be reduced somehow.

                       

                      For example, one PI DA may return an ArcOffline, which is not a real recorded value, and you certainly want to avoid them.  And if scan rates and compression were ever different, the 2 sets of values for a given time would also be different.

                      I think this won't be the case, because data filling has been done via archive merging.

                       

                      I think you have the experience with AF SDK to understand the steps involved to retrieve data, even massive amounts of data.

                      To get better idea, which method/methods and AF SDK version you propose for this operation.

                       

                      Thanks,

                      Paurav Joshi

                        • Re: Best way to compare 25 years data of 4k around tags
                          André Åsheim

                          Hi,

                          OSIsoft don't want to guarantee the data transport but if all the prep work has been done correctly we rarely check all the data (we take key tags and spot check them).

                          If you force OOO* on all the data no compression will be done on the destination server. That way you are almost guaranteed that all the data is identical. A second option is to turn off compression on the destination server during the history recovery.

                          To my knowledge a PI to PI interface in HR only mode doesn't skip scans etc which can be an issue on live interfaces so in HR only mode you are close to guaranteed the same data on the destination (as long as compression is set correctly or data is OOO).

                           

                          Is there any reason why you didn't just run a backup->restore that process is usually much faster than running history recovery via PI to PI.

                           

                          *OOO = Out of order data (data timestamped before the snapshot value).

                          • Re: Best way to compare 25 years data of 4k around tags
                            gregor

                            Hi Paurav,

                             

                            What do you like to accomplish? This question remains unanswered to me. Maybe we can get there with the help of some additional questions.

                             

                            What's the worst case / best case outcome you expect from this comparison?

                            Even more important, what's the decision / action depending on the outcome?

                            Are you trying to evaluate which server has the "better" data? What if you find DA O is better for certain tags or periods but DA N is better for other tags or periods?

                             

                            I concur that PI-to-PI is likely not a good tool for a comparison. You would have to create redundant tags on the destination server and the process of inserting the redundant history will be pretty slow because PI Archive Subsystem will have to grow archive files as they pretty likely not just filled ~ 50%. If you are finished with the insertion, you would have everything in one server but you still don't know which of each redundant pair of points has the "better" data.

                             

                            F-Test is an algorithm but you still need to load the data to memory. Using Interpolated values as suggested by Holger may help to reduce the data you apply whatever algorithm to but on both servers PI Archive Subsystem has to load all events into memory to calculate interpolated data.

                             

                            So these servers exist as independent installations. I assume a PI Backup from DA O has been restored to DA N when DA N was set up. This means that you don't have to compare the complete history but only the period starting when DA N went live with the Backup from DA O until today. Servers will likely have been maintained independently since which means Point ID's will likely not be equal on both installations. Other configuration may differ too as mentioned by Rick. Assume new Digital State Sets have been added independently than the state references in archives could differ as the string representation of a State may even be defined in another State Set. There could be typos causing that a comparison of Values may fail because a single character is missing. So there are a lot of potential problems you may face and you really need to clearly define your plan of action which considers all the what-if cases you may face. 

                             

                            I really would start by comparing configurations between both servers, things like:

                             

                            • Before you start, so first and foremost, make sure you have good full backups (PI Backup) of both servers.
                            • How many archives exist, what periods do they cover and what is their size?
                            • Compare the point table for what points exist, what are their names and how do they map to Point ID's. Create a Point ID conversion table will be key.
                            • Eventually compare the MDB and look out for differences. Is one or both servers receiving data from PI ACE modules or is PI Batch Subsystem or Pi Batch Generator used?
                            • Build a Digital State Set conversion table and make sure states can be translated uniquely in both directions.
                            • Verify if custom Point Classes exist on one of those servers.
                            • Review things like tuning parameters as they may have an impact / you may prefer one over the other for the installation you like to proceed with (DA N).
                            • Make sure both servers are on the same version for file base compatibility.
                            • Many more ..

                             

                            You should do these things before even retrieving any time series data for a comparison to ensure whatever methods you use can be successful.

                             

                            If DA O is supposed being taken out of service in favor of using DA N instead but there is uncertainty if DA O may have the better data for certain periods and points, it may make sense to make DA O unavailable to normal users but keep it live for a year or so. If after a year, there were not many cases to check with DA O, take it out of service but keep the backup. For sure DA O should be disconnected from data collection to clearly define the cutoff (EoL).

                            1 of 1 people found this helpful
                              • Re: Best way to compare 25 years data of 4k around tags
                                Paurav Joshi

                                Hey Gregor ,

                                 

                                Really appreciate your detailed reply. I think I have made this question complex, let me try to make it simple:

                                 

                                I already have 2 PI DA- one is old DA O and the other is new DA N. Data has been already copied from DA O to DA N and I am trying to find a lighter approach (in terms of memory and resource consumption) to confirm that DA N has all the data of DA O, so that DA O can be decommissioned. By data I mean here 25 years worth of data of 4k around tags.

                                 

                                I don't have a leverage of keeping it off for a year, but keeping backup is a considerable alternative approach.

                                 

                                Thanks,

                                Paurav Joshi

                        • Re: Best way to compare 25 years data of 4k around tags
                          ernstamort

                          Hi,

                           

                          I agree a sampling strategy would be the smart way to go.

                           

                          I would retrieve interpolated values and check if the difference is smaller than the calculation precision. That makes the code easier and you have homogeneous sampling across the entire time span. And you can increase the resolution as you drill down.

                            • Re: Best way to compare 25 years data of 4k around tags
                              cmanhard

                              Consider combining a sampling strategy with interval summaries - for example, you can count events per day per tag with a single call (PIPoint.AFData.Summaries or PIPointList.ListData.Summaries) that would return the # of events per day stored in a tag and you could zero in on areas where the counts do not match.  Spot checking event weighted averages in this same call could also increase your confidence that the data matches.  The summary calls keep the data in the server, saving serialization and memory in the client.  Event weighted calls will make the CPU use on the PI Data Archive lighter than time-weighted

                              1 of 1 people found this helpful
                            • Re: Best way to compare 25 years data of 4k around tags
                              Roger Palmen

                              I think we all agree that the only way to be 100% sure is to do a full comparison of all data. If that is not feasible, not really sure why not... how many events are we talking about? 1 event/minute is about 25 billion events, which sounds feasible if you really want to.

                               

                              Any other approach is about the degree of certainity you need to achieve. What is acceptable? If you randomly compary 0.1% of the values, and they have a 100% match, is that certain enough?

                              Is there a certain approach by which you transferred the data, and thereby have certain patterns you might want to check? (e.g. you have missed one month or one day). Then calculating statistics per day could be sufficient to cover those risks.

                              So i would write down the risks you see, assess the probability & impact, and decide and agree on a test to give an acceptable proof the data is correct.

                              2 of 2 people found this helpful