33 Replies Latest reply on Dec 18, 2014 9:45 AM by PetterBrodin

    AF SDK performance, serial vs. parallel vs. bulk

    PetterBrodin

      AF SDK version: 2.6.1.6238

      PI Server version: Latest 2012 (3.4.390.16 patched to 3.4.390.18)

       

      After seeing some unexpected performance behaviour in an application we're developing, partially outlined in this thread, I decided to do some more extensive testing of how bulk calls actually perform, in particular RecordedValues and Snapshot.

       

      According to Ryan Gilbert in the other thread, "The bulk calls are all enabled on 390.16 and greater, except for the summaries calls. Due to an issue with error reporting, they were disabled on the client side until version 390.19."


      This matches what can be found in the AFSDK reference (2.6.0.5843) about PIPointlist.RecordedValues:

      "Important

      This method will use a single bulk Remote Procedure Call if the PI Server supports it, otherwise it will issue individual RPCs in parallel. Results are available for enumeration as they returned from the PI Server."


      Test setup

      I created a small C# program using the AF SDK that does the following:

      1. Connects to a PIServer
      2. If they don't already exist, create 100 PIPoints named from 1 - 100
      3. After creating the PIPoints, inserts a total of 6.7 million events to the points. The data for each point simulates a sine wave, where the sample rate, frequency, number of bad values and duration varies from point to point
      4. Reads the config to see whether it should use a bulk call, parallel calls or serial calls
      5. Reads the config to see how many times it should execute each call
      6. Queries the snapshot value for each of the points, using the call type specified in 4, the number of times specified in 5. To make sure there's nothing left to be enumerated, each snapshot value is read to a StringBuilder. Logs the time each execution takes in milliseconds
      7. Queries all the data from 2013-01-01 to 2014-01-01 for all the points, using RecordedValues with the call type specified in 4, the number of times specified in 5. To make sure there's nothing left to be enumerated, the name of each PIPoint and the number of values is read to a StringBuilder. Logs the time each execution takes in milliseconds

       

      I've run the program 6 times. 1 time for each type of call on my local machine connecting to the PI Server across a network with some latency, and 1 time for each type of call with the program running directly on the PI Server.

       

      Results

      Snapshot on server

      RunParallelSerialBulk with parallel reading

      1

      259334283
      2201009
      3231138
      421908
      5209011
      Total343727319
      Average68.6145.463.8

       

      This behaves as I'd expect: with low latency due to the program running directly on the server, parallel isn't lagging a lot behind bulk, with serial being somewhat slower. There also appears to be some caching/query optimization going on with all three methods, as the first call is consistently slower.

       

      Snapshot over network

      RunParallelSerialBulk with parallel reading

      1

      43740296
      225025911
      324923113
      425923412
      527224711
      Total14671371143
      Average293.3274.628.6


      Here we can really see the value of bulk calls. The parallel and serial calls behave similarly due to both needing expensive round-trips (though I expected parallel to outperform serial), while bulk leaves them all in the dust. As with the previous example, there seems to be some magic happening behind the scenes, making subsequent calls faster. I guess with further testing, one could query just a subset of all the tags in each call.

       

      RecordedValues on server

      RunParallelSerialBulk with parallel reading

      1

      70042166321605
      270842046021454
      366512125021718
      4765421265

      21838

      583551963421536
      Total36748104272108151
      Average7349.620854.421630.2

       

       

      This is not expected at all. Not only does the bulk call get its ass kicked by parallel calls, it performs as badly as serial calls, which calls into question what the documentation says about bulk calls defaulting to parallel RPCs if parallel is not available.

       

      RecordedValues on client

      RunParallelSerialBulk with parallel reading

      1

      118052325920388
      2115752275420113
      3111442130320053
      41173121246

      19091

      5108672630219662
      Total5712211486499307
      Average11424.422972.819861.4

       

      This is pretty much the same result as above, with a few seconds tacked on due to network latency.

       

      So, what is going on here? Isn't bulk RPCs enabled for RecordedValues calls, or is there some other voodoo magic at work?


      Thanks in advance for all help!

        • Re: AF SDK performance, serial vs. parallel vs. bulk
          Marcos Vainer Loeff

          Hello Petter,

           

          Could you please share the code snippet of your test with us? I see two advantages in this request:

           

          • Make sure the community understands exactly what are you doing with Parallel, Serial and Bulk with parallel reading calls.
          • Some members could use it to make the same tests and post their results here. This might help us verify that the behavior you are seeing is not related to your custom environment.
            • Re: AF SDK performance, serial vs. parallel vs. bulk
              PetterBrodin

              I'll clean the code up a little to make sure everyone is able to run it, then post it here.

              • Re: AF SDK performance, serial vs. parallel vs. bulk
                PetterBrodin

                I've attached both the full Visual Studio solution and a build of it for those who just want to run it.

                 

                Before running, open the Settings.Settings (in the solution) or PIPerformanceTest.exe.config (in the built version) file and set it to match your setup. Here's an explanation of the various parameters.

                • bulk - whether to use bulk calls
                • serial - if bulk is set to false, this chooses whether to use serial or parallel calls
                • createData - makes the program delete old tags (if they exist) and create new ones with data before performance testing the queries
                • numPoints - how many tags to create and query
                • pointNameRoot - The first part of the name of the tags to be created and queried. They'll be named "PerformanceTestTag0", "PerformanceTestTag1" etc.
                • cleanup - whether to delete the tags after the program executes
                • retries - the number of times to run each query
                • server - network address to the server
                • username - the username to connect with
                • password - the user's password

                 

                • usePerformanceTags - a leftover I forgot to delete

                 

                The project consists of three classes:

                 

                AFValueGenerator.cs


                Two static methods for creating one or more AFValues objects containing values that simulate a sine wave.

                using System;
                using System.Collections.Generic;
                using System.Diagnostics;
                using System.Text;
                using OSIsoft.AF.Asset;
                using OSIsoft.AF.Time;
                
                namespace PIPerformanceTest
                {
                    class AFValueGenerator
                    {
                        public static Dictionary<string, AFValues> GetPerformanceTestTags(List<string> tagNames)
                        {
                            Dictionary<string, AFValues> results = new Dictionary<string, AFValues>();
                
                            StringBuilder sb = new StringBuilder();
                
                            for(int i = 0; i < tagNames.Count; i++)
                            {
                                var piPointName = tagNames[i];
                
                                int startYear = 2013;
                                DateTime start = new DateTime(startYear, 1, 1);
                
                                int endMonth = 1 + i % 12;
                                DateTime end = new DateTime(2013, endMonth, 2);
                
                                //Every third Tag will have a sampling rate of every minute, every hour and every day
                                double sampleRateInSeconds = 0;
                                int j = i % 3;
                                switch (j)
                                {
                                    case 0:
                                        sampleRateInSeconds = 60;
                                        break;
                                    case 1:
                                        sampleRateInSeconds = 60 * 60;
                                        break;
                                    case 2:
                                        sampleRateInSeconds = 60 * 60 * 24;
                                        break;
                                }
                
                                //Every fourth Tag will have a frequency of every minute, every hour, every day and every year
                                double frequency = 0;
                                int k = i % 4;
                                switch (k)
                                {
                                    case 0:
                                        frequency = 60;
                                        break;
                                    case 1:
                                        frequency = 60 * 60;
                                        break;
                                    case 2:
                                        frequency = 60 * 60 * 24;
                                        break;
                                    case 3:
                                        frequency = 60 * 60 * 24 * 365;
                                        break;
                                }
                
                                double amplitude = 1 + i % 5;
                
                                //No bad TagRecords or one bad TagRecord every 5, 10, 15 or 20 
                                int badAtEvery = (i % 5) * 5;
                
                                AFValues afValues = GenerateSineTag(start, end, sampleRateInSeconds, frequency, amplitude, badAtEvery);
                           
                                results[piPointName] = afValues;
                
                                sb.Append(piPointName + "\t" + start + "\t" + end + "\t" + sampleRateInSeconds + "\t" + frequency + "\t" + afValues.Count + "\t" + amplitude + "\tOne bad tag at every " + badAtEvery + " records\t\r\n");
                            }
                
                            Debug.Write(sb.ToString());
                
                            return results;
                        }
                
                        public static AFValues GenerateSineTag(DateTime start, DateTime end, double sampleRate, double frequency, double amplitude, int badAtEvery)
                        {
                            AFValues afValues = new AFValues();
                
                
                            //Number of events equals the number of secods from start to finish, divided by the sampleRate
                            double numberOfEvents = (end.Subtract(start).TotalSeconds) / sampleRate;
                
                            //This is possibly a little off in the case where the sampleRate doesn't resolve perfectly 
                            TimeSpan valueInterval = new TimeSpan(0, 0, 0, 0, (int)(sampleRate * 1000));
                
                            //Keeping track of the timestamp of the current TagRecord
                            TimeSpan timeSinceStart = new TimeSpan(0);
                
                            for (int i = 0; i <= (int)numberOfEvents; i++)
                            {
                                AFValue afv = new AFValue();
                
                                afv.Value = amplitude * Math.Sin(2 * Math.PI * frequency * timeSinceStart.TotalSeconds);
                                afv.Timestamp = new AFTime(start.Add(timeSinceStart));
                                afv.Status = ((badAtEvery > 0) && (i + 1) % badAtEvery == 0) ? AFValueStatus.Bad : AFValueStatus.Good;
                
                                afValues.Add(afv);
                
                                timeSinceStart = timeSinceStart.Add(valueInterval);
                            }
                
                            return afValues;
                        }
                    }
                }
                
                
                
                

                 

                PIServerImplementation.cs


                The class that's responsible for doing all the communication with PI. There's very little in the way of error handling here, so you'll need to debug in order to look closer at errors that occur.

                using System;
                using System.Collections.Generic;
                using System.Diagnostics;
                using System.Net;
                using System.Text;
                using System.Threading.Tasks;
                using OSIsoft.AF;
                using OSIsoft.AF.Asset;
                using OSIsoft.AF.Data;
                using OSIsoft.AF.PI;
                using OSIsoft.AF.Time;
                
                namespace PIPerformanceTest
                {
                    class PIServerImplementation
                    {
                        public PIServer Server;
                
                        private bool IsConnected
                        {
                            get
                            {
                                return Server != null && Server.ConnectionInfo.IsConnected;
                            }
                        }
                
                        public void Connect(string serverName, string username, string password)
                        {
                            if (!IsConnected)
                            {
                                try
                                {
                                    PIServers piServers = new PIServers();
                                    Server = piServers[serverName];
                
                                    NetworkCredential netCred = new NetworkCredential(username, password);
                
                                    Server.Connect(netCred);
                                }
                                catch (Exception ex)
                                {
                                    string errormessage = "Could not connect to PI server " + serverName + " Error message: " + ex.Message;
                
                                    Debug.WriteLine(errormessage);
                                    throw new Exception(errormessage);
                                }
                            }
                            else
                            {
                                Debug.WriteLine("Already connected to " + serverName);
                            }
                        }
                
                        public void DeleteTags(List<string> tagNames)
                        {
                            Server.DeletePIPoints(tagNames);
                        }
                
                        public void CreateTags(List<string> tagNames)
                        {
                            Server.CreatePIPoints(tagNames);
                        }
                
                        public void SetTags(List<String> tagNames)
                        {
                            PIPointList piPoints = new PIPointList(PIPoint.FindPIPoints(Server, tagNames));
                
                            Dictionary<string, AFValues> tags = AFValueGenerator.GetPerformanceTestTags(tagNames);
                
                            List<AFValue> results = new List<AFValue>();
                
                            foreach(var piPoint in piPoints)
                            {
                                AFValues afValues = tags[piPoint.Name];
                
                                foreach (var afv in afValues)
                                {
                                    afv.PIPoint = piPoint;
                
                                    results.Add(afv);
                                }
                            }
                
                            var errors = Server.UpdateValues(results, AFUpdateOption.InsertNoCompression);
                        }
                
                        public void Snapshots(List<string> tagNames, QueryMethod queryMethod)
                        {
                            PIPointList piPoints = new PIPointList(PIPoint.FindPIPoints(Server, tagNames));
                
                            StringBuilder sb = new StringBuilder();
                
                            if (queryMethod == QueryMethod.Bulk)
                            {
                                AFListResults<PIPoint, AFValue> snapshots = piPoints.Snapshot();
                
                                //TODO: does this affect performance WRT enumeration?
                                Parallel.ForEach(snapshots, snapshot =>
                                {
                                    sb.Append(snapshot.Value + ", ");
                                });
                            }
                            else if(queryMethod == QueryMethod.Parallel)
                            {
                                Parallel.ForEach(piPoints, piPoint =>
                                {
                                    var snapshot = piPoint.Snapshot();
                                    sb.Append(snapshot.Value + ", ");
                                });
                
                            }
                            else if (queryMethod == QueryMethod.Serial)
                            {
                                foreach(var piPoint in piPoints)
                                {
                                    var snapshot = piPoint.Snapshot();
                                    sb.Append(snapshot.Value + ", ");
                                }
                
                            }
                
                
                            //Console.WriteLine(sb.ToString());
                        }
                
                        public void Count(List<string> tagNames, QueryMethod queryMethod)
                        {
                            PIPointList piPoints = new PIPointList(PIPoint.FindPIPoints(Server, tagNames));
                
                            StringBuilder sb = new StringBuilder();
                
                            AFTimeRange timeRange = new AFTimeRange(
                                new AFTime(new DateTime(2013, 1, 1)),
                                new AFTime(new DateTime(2014, 1, 1))
                                );
                            PIPagingConfiguration pagingConfiguration = new PIPagingConfiguration(PIPageType.TagCount, 100);
                
                            if (queryMethod == QueryMethod.Bulk)
                            {
                                var summaries = piPoints.Summary(
                                    timeRange,
                                    AFSummaryTypes.Count,
                                    AFCalculationBasis.EventWeighted,
                                    AFTimestampCalculation.Auto,
                                    pagingConfiguration
                                    );
                
                                foreach (IDictionary<AFSummaryTypes, AFValue> summaryDict in summaries)
                                {
                                    sb.Append(summaryDict[AFSummaryTypes.Count].PIPoint.Name + ": " + summaryDict[AFSummaryTypes.Count].Value + ", ");
                                }
                
                            }
                            else if(queryMethod == QueryMethod.Parallel)
                            {
                                Parallel.ForEach(piPoints, piPoint =>
                                {
                                    var summary = piPoint.Summary(
                                        timeRange,
                                        AFSummaryTypes.Count,
                                        AFCalculationBasis.EventWeighted,
                                        AFTimestampCalculation.Auto);
                                    sb.Append(summary[AFSummaryTypes.Count].PIPoint.Name + ": " + summary[AFSummaryTypes.Count].Value + ", ");
                                });
                
                            }
                            else if (queryMethod == QueryMethod.Serial)
                            {
                                foreach(var piPoint in piPoints)
                                {
                                    var summary = piPoint.Summary(
                                        timeRange,
                                        AFSummaryTypes.Count,
                                        AFCalculationBasis.EventWeighted,
                                        AFTimestampCalculation.Auto);
                                    sb.Append(summary[AFSummaryTypes.Count].PIPoint.Name + ": " + summary[AFSummaryTypes.Count].Value + ", ");
                                }
                
                            }
                
                            //Console.WriteLine(sb.ToString());
                        }
                
                        public void RecordedValues(List<string> tagNames, QueryMethod queryMethod)
                        {
                            PIPointList piPoints = new PIPointList(PIPoint.FindPIPoints(Server, tagNames));
                
                            AFTimeRange timeRange = new AFTimeRange(
                                new AFTime(new DateTime(2013, 1, 1)),
                                new AFTime(new DateTime(2014, 1, 1))
                                );
                
                            PIPagingConfiguration pagingConfiguration = new PIPagingConfiguration(PIPageType.TagCount, 100);
                
                            StringBuilder sb = new StringBuilder();
                
                            if (queryMethod == QueryMethod.Bulk)
                            {
                                var afValuesList = piPoints.RecordedValues(
                                    timeRange,
                                    AFBoundaryType.Interpolated,
                                    null,
                                    false,
                                    pagingConfiguration
                                    );
                
                                Parallel.ForEach(afValuesList, afValues =>
                                {
                                    sb.Append(afValues.Count + ", ");
                                });
                            }
                            else if(queryMethod == QueryMethod.Parallel)
                            {
                                Parallel.ForEach(piPoints, piPoint =>
                                {
                                    var afValues = piPoint.RecordedValues(
                                        timeRange,
                                        AFBoundaryType.Interpolated,
                                        null,
                                        false
                                    );
                                    sb.Append(afValues.Count + ", ");
                                });
                            }
                       
                            else if (queryMethod == QueryMethod.Serial)
                            {
                                foreach (var piPoint in piPoints)
                                {
                                    var afValues = piPoint.RecordedValues(
                                        timeRange,
                                        AFBoundaryType.Interpolated,
                                        null,
                                        false
                                    );
                                    sb.Append(afValues.Count + ", ");
                                }
                            }
                
                            //Console.WriteLine(sb.ToString());
                        }
                    }
                }
                
                
                
                

                 

                Program.cs


                The console application. Runs through once based on the settings read from the config file.

                 

                using System;
                using System.Collections.Generic;
                using System.Diagnostics;
                
                namespace PIPerformanceTest
                {
                    class Program
                    {
                        private static PIServerImplementation _server;
                
                        static void Main(string[] args)
                        {
                            bool createData = Properties.Settings.Default.createData;
                            bool cleanup = Properties.Settings.Default.cleanup;
                            int retries = Properties.Settings.Default.retries;
                
                            //Decide the method of querying
                            QueryMethod queryMethod;
                            if (Properties.Settings.Default.bulk)
                            {
                                queryMethod = QueryMethod.Bulk;
                            }
                            else
                            {
                                queryMethod = Properties.Settings.Default.serial ? QueryMethod.Serial : QueryMethod.Parallel;
                            }
                
                            var stopwatch = new Stopwatch();
                
                            _server = new PIServerImplementation();
                
                            _server.Connect(
                                Properties.Settings.Default.server,
                                Properties.Settings.Default.username,
                                Properties.Settings.Default.password
                            );
                
                            List<string> tagNames = GettagNames();
                
                            if (createData)
                            {
                                Console.WriteLine("Deleting old tag(s)");
                                _server.DeleteTags(tagNames);
                
                                Console.WriteLine("Creating tag(s)");
                                _server.CreateTags(tagNames);
                
                                //Insert values to tags, measure performance
                                Console.WriteLine("Inserting values to tag(s)");
                                stopwatch.Start();
                                _server.SetTags(tagNames);
                                stopwatch.Stop();
                                Console.WriteLine("Set tags took {0}", stopwatch.ElapsedMilliseconds);
                                stopwatch.Reset();
                            }
                
                            switch (queryMethod)
                            {
                                case QueryMethod.Bulk:
                                    Console.WriteLine("Running performance test with bulk calls");
                                    break;
                                case QueryMethod.Parallel:
                                    Console.WriteLine("Running performance test with parallel calls");
                                    break;
                                case QueryMethod.Serial:
                                    Console.WriteLine("Running performance test with serial calls");
                                    break;
                            }
                
                            //Query snapshots, measure performance
                            for (int i = 0; i < retries; i++)
                            {
                                Snapshots(stopwatch, tagNames, queryMethod);
                            }
                
                            //Query summaries to count how many values are available
                            //for (int i = 0; i < retries; i++)
                            //{
                            //    Summaries(stopwatch, tagNames, queryMethod);
                            //}
                
                            //Query recordedValues, measure performance
                            for (int i = 0; i < retries; i++)
                            {
                                RecordedValues(stopwatch, tagNames, queryMethod);
                            }
                
                
                            if (cleanup)
                            {
                                //Delete tags
                                _server.DeleteTags(tagNames);
                            }
                
                            Console.WriteLine(Environment.NewLine + Environment.NewLine + "Press Enter key to exit");
                            Console.ReadLine();
                        }
                
                        private static void RecordedValues(Stopwatch stopwatch, List<string> tagNames, QueryMethod queryMethod)
                        {
                            stopwatch.Start();
                            _server.RecordedValues(tagNames, queryMethod);
                            stopwatch.Stop();
                            double elapsed = stopwatch.ElapsedMilliseconds;
                            Console.WriteLine("Querying RecordedValues took {0}", elapsed);
                            stopwatch.Reset();
                        }
                
                        private static void Summaries(Stopwatch stopwatch, List<string> tagNames, QueryMethod queryMethod)
                        {
                            stopwatch.Start();
                            _server.Count(tagNames,queryMethod);
                            stopwatch.Stop();
                            double elapsed = stopwatch.ElapsedMilliseconds;
                            Console.WriteLine("Querying summaries took {0}", elapsed);
                            stopwatch.Reset();
                        }
                
                        private static void Snapshots(Stopwatch stopwatch, List<string> tagNames, QueryMethod queryMethod)
                        {
                            stopwatch.Start();
                            _server.Snapshots(tagNames, queryMethod);
                            stopwatch.Stop();
                            double elapsed = stopwatch.ElapsedMilliseconds;
                            Console.WriteLine("Querying snapshots took {0}", elapsed);
                            stopwatch.Reset();
                        }
                
                        private static List<string> GettagNames()
                        {
                            var results = new List<string>();
                
                            for (int i = 0; i < Properties.Settings.Default.numPoints; i++)
                            {
                                results.Add(Properties.Settings.Default.pointNameRoot + i);
                            }
                
                            return results;
                        }
                    }
                
                    enum QueryMethod
                    {
                        Bulk,
                        Parallel,
                        Serial
                    }
                }
                
              • Re: AF SDK performance, serial vs. parallel vs. bulk
                Rhys Kirk

                What are the resources of your PI Server, and how do they behave with each test? I am thinking along the lines of memory here.

                 

                Point initialization could explain the increase in time for the 1st call. Are you disconnecting between tests? Connection overhead could also explain the 1st call increase.

                 

                There could be a trade off with bulk calls & large event retrieval (for example recorded values) in a bulk call. You could imagine that a bulk call is going to assemble everything (all the events) in a single thread for all the passed PI Points, and if that event count is large is will take longer to process. With the parallel calls you are spreading the load across multiple subsystem threads and returning the data in batches, so as long as your latency isn't too large you will likely see quicker results across the multiple threads offered by parallelism rather than the single bulk thread. Certain operations over higher latency connections will definitely perform better than others, I think RecordedValues is likely one of those poorer performing calls, especially depending on the density of data that needs to be assembled. There is probably a magic formula needed (I am sure I've requested that formula on another thread ) to know based on conditions what is the optimal way to retrieve data...the answer always seems to be "it depends".

                 

                Hope this helps...

                  • Re: AF SDK performance, serial vs. parallel vs. bulk
                    PetterBrodin

                    I just ran the program again while looking at the Performance tab of the Task Manager, and nothing really sticks out there:

                     

                    Call typeCPUMemory
                    BulkGoes from idle to around 16%. A few short spikes.No huge change
                    ParallelGoes from idle to fluctuations between 40% and 90%No huge change
                    SerialGoes from idle to around 16%No huge change

                     

                    "Point initialization could explain the increase in time for the 1st call. Are you disconnecting between tests? Connection overhead could also explain the 1st call increase."

                    It could be the point initialization, but I'm making a new call to FindPIPoints for each iteration of the test. I'm connecting at the very start of the program, and that's not counted as part of the timer.


                    I get what you're saying WRT the cost of bulk calls, but the combination of the time it takes to run and how similarly it behaves to a serial call leads me to believe there's something else going on here.

                      • Re: AF SDK performance, serial vs. parallel vs. bulk
                        Rhys Kirk

                        Bulk and Serial will behave similarly on the server end though, unless of course I misunderstand the implementation of bulk in this respect. The "bulk" is really about avoiding network latency affecting the time for the end to end of the data retrieval to complete, the key thing is that on the server side the way in which the data is retrieved is probably the same across all three types of calls. What bulk allows you to do is queue up the PI Points for which data is to be retrieved on the PI Server side of the RPC, instead on keeping that list on the client and serially requesting the data.

                        If bulk RPC only gets executed by a single thread then you are in effect running in serial on the PI Server - unless there is some optimisation within the bulk RPC on the receiver - without the network latency in between the data retrieval for each PI Point.

                         

                        The initialization overhead is skewing your results too because for your snapshots on the server you would expect that to be faster than snapshots over the network for bulk calls, but you see the initialisation is 283 instead of 96.

                         

                        The way the PI Server likely executes serial requests for data and the bulk data request is the same, the difference is the RPC that is called. You see evidence of this on the RecordedValues results you have; they both perform similarly. The additional overhead in performance for the bulk RPC may be in fact the overhead in how the procedure is handled by the receiver in that it has to handle the complete dataset which means the list of values is larger to enumerate (if for example there is any list methods performed before returning the resulting data set) than the smaller data sets per PI Point.

                        Your Recorded Values on the client then shows the effect of the network latency, around 22 seconds of latency. If you have near 0 latency then you probably will see closer comparisons.

                      • Re: AF SDK performance, serial vs. parallel vs. bulk
                        PetterBrodin

                        Looking at the CPU use a bit closer, there's a spike at the very start of every iteration when doing bulk calls, then it reduces drastically until the call is finished and the next call starts. With serial calls, there's no such spike.

                        Performance.png

                      • Re: AF SDK performance, serial vs. parallel vs. bulk
                        Dan Fishman

                        Thanks for sharing!  The first run is typically the slowest since you are likely loading the data from disk into memory.  When testing, I've been ignoring my first run since the next run is faster if I  haven't tested in awhile.

                         

                        I'm running several similar test using bulk, serial, parallel, and parallel with bulk (using a Partitioner class that Rick Davin used).  I've found that Parallel has been giving me the fastest results as well.  When querying for smaller amounts of data (300K values) bulk, parallel and parallel with bulk are relatively close.  When I return ~2.8 million values using 350 tags I've noticed parallel is the winner by far. 

                         

                        I'm curious what your results are if you query for only 1 months of data.  Based on my tests, I would expect bulk and parallel is be closer.  What is your paging configuration set for?

                         

                        Regards,

                         

                        Dan

                          • Re: AF SDK performance, serial vs. parallel vs. bulk
                            PetterBrodin

                            I guess I should add a config to ignore the first run when doing the same query repeatedly in order to avoid muddying the data.

                             

                            If I query a month worth of data, I get the following results:

                                

                            On client

                            RunParallel, just readingSerial, just readingBulk with parallel reading, paging 100Bulk with parallel reading, paging 10
                            12464463833623552
                            22043423129313491
                            31844410130013106
                            41803375729163149
                            51739423530463430
                            Total9893209621525616728
                            Average1978,64192,43051,23345,6

                             

                            On server

                                

                            RunParallel, just readingSerial, just readingBulk with parallel reading
                            1128130913193
                            288028652903
                            394128813035
                            494527563023
                            596928063056
                            Total50161439915210
                            Average1003,22879,83042

                             

                            I've normally had the paging set to TagCount with 100, but I ran a quick test with it set to 10 as well, which was about 10% slower than 100.

                             

                            So you're right that they're closer, with bulk now being somewhere in between parallel and serial in terms of performance.

                          • Re: AF SDK performance, serial vs. parallel vs. bulk
                            Rick Davin

                            I've given your code a cursory review.  I think the label of "Bulk with parallel reading" is confusing, because it's really just "Bulk".  I say this because on the now defunct vCampus I had recently posted some code that did indeed perform (bulk+parallel), which is to say it would chop up a large number of PIPoints into buckets of 100, and then parallel fetch data for the buckets using a bulk call.  If I had 5000 points, then I would have 50 buckets to be fetched in parallel.  But with your code, you are fetching all tags in 1 call so it truly is bulk only.

                             

                            Note that StringBuilder does not play nice across parallel calls, whether your query method is Bulk, Parallel, or Serial.  To use it properly you must issue a lock, which could degrade performance.  It also muddies the waters as to what you are measuring: fetching data via different methods, OR locking and appending to StringBuilder repeatedly.

                             

                            (Since PISquare is so new, I had to try to find old-but-recent vCampus post).  The link below had some sample code on using bulk+parallel, and also demonstrates locking inside parallel calls:

                             

                            https://pisquare.osisoft.com/message/31456#31456

                              • Re: AF SDK performance, serial vs. parallel vs. bulk
                                PetterBrodin

                                I see what you mean about "parallel reading". I referred to it as that because I enumerate the results in parallel, if that has anything to say for the performance.

                                 

                                And I know the StringBuilder isn't thread safe, but in this case I simply wanted a cheap object I could use to make sure the collection is actually enumerated. As you can see I'm not doing anything with the SB after I'm done iterating the AFValues objects.

                              • Re: AF SDK performance, serial vs. parallel vs. bulk
                                Mike Zboray

                                Note that using Parallel.ForEach without constraining the parallelism (use ParallelOptions.MaxDegreeOfParallelism) can be dangerous. If one server is slow or timing out, the TPL can exhaust all threads in the .NET thread pool, using a huge amount of memory. There was a defect reported on memory usage/load time of the analysis engine at startup recently that we attributed to this. We were trying to loaded analyses in parallel on multiple threads using Parallel.ForEach and if the AF server was slow to respond a large number of threads are created actually slowing down loading. This is fixed in the latest internal builds, so I would expect to see it in AF 2.7.

                                2 of 2 people found this helpful
                                  • Re: AF SDK performance, serial vs. parallel vs. bulk
                                    Rick Davin

                                    TPL seems so promising but there's a lot that scares me about.  Funny thing is the one thing I'm not scared about is the "out of order" processing nature of it all.  I'm cool with that. 

                                     

                                    But there's the whole thread-safe concurrency thing, setting MaxDegreeOfParallelism (which isn't hard to set but just one of those extra things you must do), setting a loopState object, when to lock, and finally its so easy to create a parallel loop that is slower than a simple serial loop.  Each loop spawns a new thread which has its own overhead, which can degrade performance.  So then you try to improve it using a Partitioner.  Suddenly, simple code that was easy enough to follow is now very complex and tough to debug.  And the bad news is you don't know if the effort is worth it until you actually code it and measure it. 

                                     

                                    And I'm speaking in very general terms here that have nothing to do with AFSDK and everything to do with .NET's TPL.

                                    • Re: AF SDK performance, serial vs. parallel vs. bulk
                                      Rhys Kirk

                                      "There was a defect reported on memory usage of the analysis engine a while back that we attributed to this."

                                       

                                      There's also a memory leak (patched for 2.7+) in PI.Net that shows up under a high number of threads.

                                    • Re: AF SDK performance, serial vs. parallel vs. bulk
                                      rgilbert

                                      Just curious. What is the latency between the client and the server?

                                       

                                      Also can you re-run your test changing this:

                                                      var afValuesList = piPoints.RecordedValues(  
                                                          timeRange,  
                                                          AFBoundaryType.Interpolated,  
                                                          null,  
                                                          false,  
                                                          pagingConfiguration  
                                                          );  
                                        
                                                      Parallel.ForEach(afValuesList, afValues =>  
                                                      {  
                                                          sb.Append(afValues.Count + ", ");  
                                                      }); 
                                      
                                      
                                      
                                      
                                      

                                       

                                      To this:

                                                      var afValuesList = piPoints.RecordedValues(  
                                                          timeRange,  
                                                          AFBoundaryType.Interpolated,  
                                                          null,  
                                                          false,  
                                                          pagingConfiguration  
                                                          );  
                                        
                                                      foreach (AFValues afValues in afValueList)  
                                                      {  
                                                          sb.Append(afValues.Count + ", ");  
                                                      }
                                      
                                      
                                      

                                       

                                      As Mike Zboray mentioned, I'm concerned that the TPL is spawning a ton of unnecessary threads. The IEnumerable returned wraps a BlockingCollection which blocks until results come back. I'm concerned the waiting is causing a bunch of threads to be spawned. Also you might want to try it again with several different page sizes: 10, 20, 50, etc.

                                        • Re: AF SDK performance, serial vs. parallel vs. bulk
                                          PetterBrodin

                                          The latency is just a handful of milliseconds, so I think the majority of the time difference in RecordedValues is simply due to the time it takes to transfer the data over the network.

                                           

                                          I ran the test again, both with the parallel and the serial reading, and at different page sizes. Each data cell is the average time across 5 runs:

                                           

                                          Page sizeParallel.ForEach average of 5 runs foreach average of 5 runs
                                          102387922926.2
                                          2022708.420621
                                          5020216.219615.8
                                          1002114420877.8
                                          Average21986.921010.2

                                           

                                          As you can see, the parallel performs ever so slightly worse than the serial reading, but not by a huge amount.

                                        • Re: AF SDK performance, serial vs. parallel vs. bulk
                                          Mike Zboray

                                          Following on what @Rhys was saying in his earlier comments. You've got a lot of data. I suspect internally we can't actually do much better with a bulk call since there is so much data in each tag. Remember, bulk calls help you when it reduces the total number of server calls. With 6.7M values, a bulk call doesn't help so much because we will have to transfer the data in reasonable sized chunks anyway (say 10K values at a time, I don't know what the actual size is). The bulk call would probably help if each tag was somewhat sparsely populated (say 100 values per year). Of course knowing whether a tag is dense or sparse in any particular time range is somewhat cumbersome, requiring a summary call to the PI server.

                                          • Re: AF SDK performance, serial vs. parallel vs. bulk
                                            awoodall

                                            I think the Bulk call bottleneck is that it only seems to construct the resulting IEnumerable<AFValues> on a single background thread. The PI server RPC archive bulk call seems to be relatively fast to return results for processing, but I think the AFSDK can get bogged down marshaling a large quantity of results to the AFValue objects. With multiple threads doing the calls in parallel it seems like it can take care of that work faster by taking advantage of the whole CPU. For instance on my 4 CPU system I saw the Bulk call only get up to 25% CPU utilization max, however with Parallel I saw 100% CPU utilization while processing.

                                              • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                rgilbert

                                                You get a task for each bulk query, and each PI Data Archive in the point list gets its own bulk query. So if you had a point list with points from four different PI Data Archives, you would have four tasks each processing the returned pages from their respective PI Data Archive in parallel. Each task executes a page processor to process the incoming pages of values and does any post processing on them such as UOM conversion or interpolated boundary calculation. As the values are processed they are immediately added to a ConcurrentQueue which is wrapped by a .NET BlockingCollection. The BlockingCollection exposes an enumerable that allows you to enumerate the entries in the queue as they are added. As new entries in the queue become available, your foreach loops continues to iterate. So the results do not need to completely processed before you are able to enumerate them. Also the thread that enumerates the results is not the same one that does the post-processing, so the results are being generated independently. The underlying native layer has also been optimized so that we can now pass in a delegate to generate our AFValue objects directly from the primitives on the PI events. We no longer have to do an extra marshaling step as of AF SDK 2.6. The partial results being available wouldn't matter in this case, since the stopwatch is timing the process from start to end.

                                                 

                                                In Petter Brodin's example, he has 6.7 million events, and they all come back in one page or logical chunk, so we don't get the benefit of processing partial results and making them available while additional pages are fetched. I imagine that the entire sequence of TCP packets containing those 6.7 millions events has to received before the native layer verifies the RPC result is valid and hands it over to us for processing. I agree with Arnold that you would see better results with parallel, especially if you have little latency. You get the benefit of more operations going in parallel and more threads processing results. The PI Data Archive by default uses 16 threads to process bulk queries. So those 16 threads would be used to collect the 6.7 million events spread across the 100 tags. If you were using a Parellel.ForEach loop without clamping the MaxDegreesOfParallelism option (Petter is not restricting it), then you could easily utilize more than 16 threads on the PI Data Archive to fetch results. Because of the way the .NET task scheduler works, I imagine that is indeed happening. Any time a task makes a blocking call, the scheduler will see the tasks is blocked and kick off another one, so you'll end up with a lot of concurrent operations going. As I mentioned in a previous thread this isn't always good, since you can hit the max transaction limit on the PI Net Manager subsystem.

                                                 

                                                The bulk calls really shine when there is latency. In the presentation Petter linked on slide 7 it shows why. You reduce the number of latency hits when you batch up the calls. If your network has no latency, then you don't really get that benefit. We used DummyNet to test a bunch of latencies and hold them constant while the tests ran.

                                                 

                                                It would be interesting to know what the latency is between Petter's client and server. It would also be interesting to see if he gets better results by increasing the number of threads in the thread pool that handles bulk queries, and it would be interesting to see the results with several different page sizes.

                                                1 of 1 people found this helpful
                                                  • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                    PetterBrodin

                                                    Thanks for a detailed reply!

                                                     

                                                    I've tried testing the parallel approach with several different values for MaxDegreeOfParallelism, and here are the results:

                                                     

                                                    Max threadsAverage time
                                                    127860,8
                                                    218988,2
                                                    316467,8
                                                    414964,6
                                                    613492,6
                                                    813221,2
                                                    1013032,2
                                                    1612990,4
                                                    2411878,4
                                                    3210531,8
                                                    1000012080

                                                     

                                                    "It would also be interesting to see if he gets better results by increasing the number of threads in the thread pool that handles bulk queries"

                                                     

                                                    Is this a setting in the PI Server, or are you thinking of doing bulk calls in parallel from the client?

                                                     

                                                    "The bulk calls really shine when there is latency."

                                                    Yeah, In understand that's where the main performance boost can be seen, but as I said earlier in the thread, the presentation led me to believe that there at least wouldn't be such a large performance gap between bulk calls and parallel calls.

                                                    • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                      PetterBrodin

                                                      I just ran a few tests with different settings (2,4, 8, 16, 32) to Archive_BulkQueryThreadNumber, and the differences were very small, and inconsistent enough that it just seemed like small fluctuations.

                                                        • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                          rgilbert

                                                          What is the latency between the client and the server? I agree with what Arnold Woodall and Rhys Kirk said earlier. I talked a little with Ling Fung Mok today as well. When doing the calls in parallel, you do get the benefit of having multiple threads doing the post processing and executing concurrent queries. The downside is that you have many more RPCs  or out of process calls executing, and you take the latency hit on each one. If your latency is small, then it might not have much of an impact. If the latency is not negligible, then the bulk RPC is better since you are batching your data access calls and reducing the number of RPCs. In your case the payload is quite large, so the packets  As Arnold mentioned, we do only have one thread per bulk query doing the post processing while you get many more by doing a Parallel.ForEach loop yourself.

                                                           

                                                          What do you get when you ping the server from the client? If you use something like DummyNet to introduce a 50ms latency, do you results change?

                                                            • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                              PetterBrodin

                                                              In my non-caffeinated state yesterday morning I'd managed to write that the latency is "just a handful of seconds", when I meant milliseconds. Testing with ping, it's between less than 1 and 3 ms. The network speed is 100 Mbit.

                                                               

                                                              I'll try DummyNet and get back to you.

                                                              • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                                PetterBrodin

                                                                I couldn't get DummyNet to work, but I used NetBalancer to set a 50ms delay on the pinetmgr process.Here are my results, with and without the 50ms latency:

                                                                 

                                                                Run typeSnapshot, low latencyRecordedValues, low latencySnapshot, 50ms latencyRecordedValues, 50ms latency
                                                                Bulk, page size 10232292621782296
                                                                Bulk, page size 20242062121672985
                                                                Bulk, page size 50271961621962093
                                                                Bulk, page size 1002420878218[-10722] PINET: Timeout on PI RPC or System Call.
                                                                Parallel57012486414054375
                                                                Serial1014255421016097165

                                                                 

                                                                I also did another test run with just a month's worth of data, where not all the tags have data, but they're still being queried:

                                                                 

                                                                Run typeSnapshot, low latencyRecordedValues, low latencySnapshot, 50ms latencyRecordedValues, 50ms latency
                                                                Bulk, page size 102411502672412
                                                                Bulk, page size 20231100

                                                                384

                                                                2012
                                                                Bulk, page size 502411753781900
                                                                Bulk, page size 10060 (one run took 208 ms)10553621812
                                                                Parallel57287811202000
                                                                Serial1015203216463400

                                                                 

                                                                So it looks like Snapshot is behaving as you'd expect, with latency having a very large impact, and the amount of data being queried not mattering.

                                                                 

                                                                With RecordedValues Parallel is still faster or as fast at 50ms latency.

                                                                 

                                                                Is this in line with what you'd expect to see?

                                                                  • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                                    rgilbert

                                                                    BulkRecValues.png

                                                                    No, but our test is definitely different for benchmarking.

                                                                     

                                                                    In the screenshot above is the results with a 50ms latency with 10Mbit of bandwidth. The bulk portion of the chart has the results for the different page sizes for 25 to 1000 (Tag Count).

                                                                    In this test, we are retrieving 3 million events instead of the 6.7 million in your test. And our test has the events spread across 25,000 tags rather than just a 100. So in our test each tag has an average of 120 events being collected while your test has an average of 67,000 per tag. It also means our parallel implementations makes a lot more RPCs than yours; a lot more.

                                                                     

                                                                    There are a lot of variables that have changed in your test from our benchmark, and obviously our test can't reflect everyone's scenario. Your test will have a total of 100 parallel calls while ours has 25,000. That means our test has more RPCs: 25k versus 100. But that isn't an exact comparison, since many of the RPCs are running in parallel. I have noticed that if you leave your MaxDegreesOfParallelism unclamped, your variance will go way up with the 25,000 RPCs. So there were some outliers with parallel in that environment.

                                                                     

                                                                    The goal of the bulk calls is to reduce the number of times you incur the latency hit. In most production environments, latency is not zero, so this should be a benefit in many situations. This reflects well in our test with a high tag count, but not well in your case where the tag count is lower (fewer RPCs) and the archive density is higher (large payloads that probably get fragmented at the IP layer).

                                                                     

                                                                    I like this discussion, and I'm glad you posted it. It is great to see where one approach works better than others; unfortunately you have to do a lot of experimenting. As Rhys Kirk has pointed out, we are typically asked for a "magic formula" of when to do what, but it just isn't possible. Latency, bandwidth, memory, Disk IOP/S, CPUs / cores, archive density, tag count, etc. just cannot be held constant. When these variables are known, it is easier to do some auto-tuning.

                                                                     

                                                                    In addition we may use your results to investigate further and possibly change the behavior of PIPointList. If your list count is small, we can always just trigger it to use the parallel fallback logic that is already used when the PI Server version is less than 390.16.

                                                                    2 of 2 people found this helpful
                                                          • Re: AF SDK performance, serial vs. parallel vs. bulk
                                                            PetterBrodin

                                                            Since this thread is getting pretty long, and the pieces of the puzzle are spread across different posts, I figured I'd try to summarize what I've learned so far to give an overview for anyone who stumbles across this thread later. The whole thread is worth reading, but this post will try to act as a tl;dr.

                                                             

                                                            Serial

                                                            Let's get this out of the way first. I can't think of a use case where you'd want to use serial calls. At best it will perform the same as bulk calls, at worst it will get its ass kicked by both bulk and parallel calls. This is especially true for cases where you query many points for little data, like when getting snapshots. If you query only one point, it doesn't make a meaningful difference which method you choose, and the more data you need returned, the more time will be spent sending data over the network.

                                                             

                                                            Parallel vs. bulk

                                                            This is not, as I initially thought, a clear-cut case of "bulk will always win". There are four main factors that act in unison to determine whether bulk or or parallel calls will perform the best:

                                                            • Latency - The higher the latency, the higher toll each RPC will take on your performance. If your client sits on top of your server, you'll get completely different results than if you're on an oil rig querying a server on land via satellite.
                                                            • Bandwidth - If you exhaust your bandwidth, the difference between call methods will be smaller than if latency is the bottleneck, since a large amount of execution time will be spent sending data over the network.
                                                            • Number of points - More points mean more round trips, and more round trips means more chances for latency to make a difference.
                                                            • Amount of data - The more data needs to be sent, the more time will be spent getting data from the server. Furthermore, with large amounts of data per point, the network traffic needs to be broken into smaller chunks anyway, which means that latency will play a larger role for bulk calls.

                                                             

                                                            Lessons learned

                                                            Here are a few things that are worth taking into consideration:

                                                            • If you're using parallel RPCs, querying lots of points has the potential to spawn a huge number of threads, which could have catastrophic amounts of overhead. Try using MaxDegreeOfParallelism to constrain the number of threads being created.
                                                            • If your client runs on top of the server or in other environments with very low latency, parallel is likely to perform better than bulk calls. If I understand it correctly, this depends on the number of threads you spawn in parallel vs. the number of threads being available on the server to process your bulk query. This means that, for better or worse, parallel calls has the potential to utilize much more resources than bulk calls.
                                                            • If you make a bulk call, it might seem tempting to read the results in parallel, but the reality is that the data is being returned from the server serially, so parallel reading will actually be slightly slower due to the overhead of doing things in parallel.
                                                            • Page size will affect performance, but how large effect is has is dependent on the four factors mentioned above.
                                                            • When testing, point initialization has the potential to make the first call significantly slower, so it makes sense to ignore the first run.

                                                             

                                                            When developing your own client applications, you will have to experiment with your own servers, data and use cases to figure out which approach works the best for you. There's no "one size fits all" approach that will always be best.

                                                            Multiple PI Servers

                                                            All the tests I've done have been done on a single PI Server, since that's the kind of use case we're looking at. It's worth keeping in mind that if you are querying multiple servers for points, the above considerations have to be made for each separate server. If you have a list of 1000 points spread across 5 servers, you need to look at the network performance between the client and each server, how many points are on each server, and how much data you're getting from each point. A single bulk call made on the whole PIPointList will definitely result in the nicest-looking and most maintainable code, but it has the potential of resulting in sub-optimal performance.

                                                             

                                                            In conclusion

                                                            • Don't use serial unless you have a very good reason to do so.
                                                            • The more latency you have, the more bulk calls will shine.
                                                            • The less bandwidth you have, the less parallel vs. bulk will matter.
                                                            • There's no magic bullet! Test and experiment with real-world data.
                                                            • If you use a bulk call, it will probably make sense to read your results serially.
                                                            • If you use parallel calls, you'll probably want to limit the amount of threads you can spin up.

                                                             

                                                            Thanks to Rhys Kirk, Rick Davin, Mike Zboray, Ryan Gilbert and everyone else who helped out with this! Please correct me if I got something wrong or if there's something that should be added.

                                                            2 of 2 people found this helpful