Only got questions for you. One, are you using Parallel LINQ, System.Threading.Tasks.Parallel, or your own custom threading? Secondly, when you say that a single thread is faster, are you issuing a bulk call the 5 attributes (probably not in 2.5) or processing them serially?
Some various vague and extremely general comments that may be of little truth or help to you ...
Sorry but parallel processing for 2, 5, 10, and 20 items isn't strongly justified. Creating and managing threads does carry a cost that may not be warranted by your application.
I'm under the impression that a PI server can handle a maximum of 8 concurrent RPC calls (don't know if its true, 8 max for everyone, or 8 max per user). So when I use System.Threading.Tasks.Parallel, I also specify the ParallelOptions.MaxDegreeOfParallelism to be around 3 - 4 threads depending upon what I am doing.
Retrieving PlotValues shouldn't be too big of a burden per attribute. Sure, one attribute may only have 2 values, but even an attribute with a large amount of values should return a much smaller subset using PlotValues. Therefore, one can almost, sort of, kind of consider the processing requirements per attribute to relatively consistent.
That said, if you every want to process a lot of things with a count in the 100's or 1000's in parallel, and those the processing of each individual thing is fairly consistent with any other thing in your list, then straight-up parallel processing on each individual thing can be quite inefficient due since each thing requires the overhead of creating and managing a thread. If you feel that is your case, I suggest you look into System.Collections.Concurrent.Partitioner. It divides your collection of attributes into partitions or ranges. You then would parallel process a range for the outer loop but serially process each attribute in an inner loop. This is more efficient because maybe only 4 threads will ever be created, instead of say 1000.
Again that depends on the application. If you are fetching full archives and not just PlotValues, then this may not be a good fit. I have instances where one tag may have only 10 values over a given time period and the very next tag in my list has 500,000. This would be a poor fit for partitioning.
Dunno if this will add anything to the solution. but why use threads when you obviously don't need them? But the real question: have you been able to find a break-even point when you scale this up where the multi threading is faster?
Really like the suggestion of Rick of using the Partitioner class!
Rick, thanks for the questions. I've been chained to my desk in a dark room in an attempt to solve this.
The programming was sound, some areas for on-going optimization but it should have functioned without such delays. In fact as it turned out in some environments it did function as expected, but in some (the important, production ones) the horrendous latency was being introduced.
As a side note, the subsystems in the PI Server can have a configurable number of threads to serve RPCs, the default is 8 for piarchss.
I won't bore you with the long list of checks and tests we had to go through, but it ultimately came down Nagle's Algorithm and delayed ACK in the TCP stack. It just so happens that the particular data pattern being requested was just right to trigger the algorithms to apply their timer for delayed ACK. The repetition of data request from the multiple threads was also a trigger point, i.e. a single thread did not trigger the delays as data was not being repeated in parallel.
In case anyone wants a read, this has the same fix that we ultimately ended up applying:
Have to mention my new best buddies from TechSupport and the Development team...Arnold, Eddy, Ryan and Alexander!
Nice job. I'm glad you guys figured it out. Here I thought we missed something in the AF SDK test suites.
Guess I should keep these guys around, they seem useful