Load balancing AF for both High Availability (HA) and Performance with the PI Integrator

Version 1

    Load balancing AF for both High Availability (HA) and Performance with the PI Integrator

    At Peak we use currently Windows NLB clustering to achieve HA of AF services.

    https://technet.microsoft.com/en-us/library/cc725691(v=ws.11).aspx

     

    We have 2 AF nodes configured as members of a cluster with the same cluster IP address/hostname. In addition to providing HA, we also wanted to improve AF performance by truly load balancing application sessions to the AF cluster so that the workload is being serviced by both nodes simultaneously. This was particularity important to achieve for our ESRI based visualization solution which generates a very high transaction load against AF via the PI Integrator for ESRI.

     

    In this specific case the NLB cluster for AF is depicted here:

    p1.jpg

    Problem: Load was not being balanced across both nodes
    We observed that even though we have 2 AF nodes in the cluster, all of the connections from the PI Integrator were always being directed only to node 03 instead of being distributed across both nodes. Getting the behavior we wanted took resolving 2 problems.

     

    Problem 1:  Incorrect NLB node “affinity” setting

    Reference: https://technet.microsoft.com/en-us/library/cc771709(v=ws.11).aspx

     

    Originally the NLB node affinity setting had been set to “single.” The “single” option specifies that NLB should direct multiple requests from the same client IP address to the same cluster host. This was not the desired behavior as it would force all connections from the PI Integrator to a single node. The solution to this problem was to change the affinity setting to “none.” The “none” option specifies that multiple connections from the same client IP address can be handled by different cluster hosts (there is no client affinity).

     

    Using PI System Explorer before and after the change I was able to verify that prior to the change, all my multiple PI System Explorer sessions would connect only to node 03. After the change, my multiple PI System Explorer sessions would get distributed to both the 03 and 04 nodes.

     

    Problem 2: Connections from the PI Integrator still continued to connect only to node 03, even after re-starting the PI Integrator instance.

    I further confirmed this behavior by removing the 03 node the Integrator was always connecting to, from the NLB cluster. Regardless, Integrator connections would still only go to the 03 node. It was as though NLB was being ignored by the Integrator even though my sessions from PI System Explorer were now being handled by NLB as desired. 

     

    Problem 2 Analysis and Resolution:

    The PI Integrator uses the PI AF Client stack as its means to connect to an instance of AF on port 5457.

    https://techsupport.osisoft.com/Products/PI-Integrators/PI-Integrator-for-Esri-ArcGIS/System-Requirements

     

    On the PI Integrator server I looked at the registry key in which the connection configuration of the PI AF Client is stored. It showed only the AF NLB cluster name of “af.vancouver.tems.int” which is correct and what I expected to find, as opposed to finding the hostname of the 03 node which may have explained the behavior.

    p2.jpg

    NOTE: the PISystem name attribute in the registry comes from the Name: field of the PI AF Server Properties. It does not come from the Host: field.

    zz.jpg

     

    By starting a PI System Explorer session on the PI Integrator server, I was able to examine the connection configuration of the PI AF Client stack in the Explorer UI as opposed to the registry. I found this:

    p3.jpg

    This would explain why even though the Name: property was the cluster name, the connections from the Integrator were all getting forced only to node 03.

    • HOWEVER... THIS IS WHERE IT GETS WEIRD!... (remember the Host: property value appears nowhere in the registry. It is stored somewhere else???)

     

    CONFIG CHANGE 1:

    I then changed the properties to this:

    p4.jpg

    Just for good measure I also re-booted the Integrator server. After the server came back up and began to re-connect to AF it was still only connecting to node 03. Seemingly ignoring the NLB configuration.

     

    CONFIG CHANGE 2:

    I then changed the properties to this and re-started the Integrator:

    p5.jpg

    After the server came back up and began to re-connect to AF it was now connecting only to node 04 as would be expected given the Host: property setting.

     

    CONFIG CHANGE 3:

    Finally I changed the properties back to this and re-started the Integrator:

    p6.jpg

     

    NOW, FINALLY, the connections from the Integrator are getting distributed across both the 03 and 04 nodes in roughly a round robin manner.

     

    To Summarize: The client configuration used in CONFIG CHANGE 1 & 3 were identical but connectivity from the PI Integrator to AF did not behave as expected until CONFIG CHANGE 2: was performed as an intermediate step.

     

    Observations and Hypothesis

    Is seems to me that some other factor was in play here. It was only after forcing the SDK to connect to both nodes in the cluster by alternating the properties and then finally switching back to just the cluster name, that the SDK directed session connections only to the cluster name “af.vancouver.tems.int” instead of by-passing the cluster in favor of a specific node.

     

    This is just me hypothesizing. I do not pretend to know all the intricacies of the SDK stacks, how they interoperate with the services or what other mechanisms other than what I have described here my be involved. There may be other possible explanations for what I have described but here is a hypotheses:

     

    Running AF in a Collective has now been deprecated in favor of load balancing or Windows Clustering. It makes me wonder if logic in the SDK still expects a Collective as a means to discover available AF nodes for HA. 

     

    Outcomes:

    • We now have HA of AF services for the PI Integrator. In the original state had the 03 node failed, the Integrator would not have connected to the surviving 04 node and we would have had an application outage.
    • Load balancing the PI Integrator connections across the 2 AF nodes has improved transactional performance somewhere in the neighborhood of 35-45 percent.

     

    Regards, Dayna.