AnsweredAssumed Answered

Data Backfill Options & Considerations

Question asked by vwitzel on Sep 23, 2020
Latest reply on Sep 30, 2020 by Roger Palmen

Hi all,

We need to do a one-time ingest of a large volume of time-series data spread across a large number of CSV files (there will likely be one file per tag) into a single PI Data Archive server. While some of the data will flow into new PI tags and resultingly come into PI in order, other data will flow into existing tags as out of order (OOO) data. In planning for this process, we are trying to optimize our approach and configuration to ensure things run as efficiently as possible. Specifically, we are considering:

 

  1. What ingress application to use?
    Regardless of which application we use, I'll point out that we plan to run it - and place the data to be ingested - locally on the same server as the PI Data Archive to eliminate the need for the data to travel across the network. The primary applications we are considering are the PI UFL Interface, PI Connector for UFL, and the AF-SDK. In comparing these options for data backfilling, the main characteristics that come to mind are:
    • both the UFL interface and connector process input files one at a time (Martin Hruzik, perhaps you could confirm ?)
    • the connector can take advantage of multiple cores (up to four), whereas the UFL interface only uses one (as detailed here)
    • an AF-SDK application could be written to process multiple files in parallel and use multiple cores
    • while the UFL interface allows for multiple instances on the same server to "scale up" its processing abilities, only one instance of the UFL connector can be installed per machine (i.e., we would have to "scale out" with the UFL connector to using multiple servers)
      With the above in mind, I don't see a reason to write an AF-SDK application, as we could scale up/out the UFL instances as necessary to maximum use of the available computing resources. In terms of deciding between the UFL interface and connector, I would lean towards the connector. Are there any other considerations the community would suggest?
  2. What, if any PI Tuning Parameters can/should be adjusted for this type of operation?
    That may be more of a question for tech support, but I am curious to hear if the community has any input on this.
  3. What, if any hardware specifications can/should be adjusted for this type of backfilling operation?
    The PI Data Archive is running on an Amazon EC2 virtual machine (VM), so we may have some control over the resources available to the VM. My understanding from KB00717 is that the I/O operations per second (IOPS) of the VM's hard-drive (specifically, random write speeds) is the most important characteristic in determining how quickly data can be written to PI archive files. With that in mind, I am thinking that a VM that is (what Amazon calls) "elastic block storage (EBS) optimized" with dedicated bandwidth or a VM using a dedicated solid state drive would probably be our best bet. That aside, I am curious to hear the community's thoughts on anything else we could adjust on the VM's configuration to optimize the backfill process.
  4. In the long-term, what is the optimum archive type?
    We plan to create dynamic archives for storing the backfilled data. Is there any benefit to reprocessing the dynamic archives into fixed archives once backfilling is complete? KB00848 alludes to this, and I was wondering why that would be beneficial. I assume dynamic archives grow in chunks (similar to SQL databases), so by reprocessing, any unused part of the last chunk would be recovered. Can anybody confirm or deny this?

Thanks for your input!

Outcomes