10 Replies Latest reply on Jan 31, 2012 2:41 PM by Asle Frantzen

    Monitoring the ACE Scheduler health status?

    Asle Frantzen

      Hi

       

      Because of misc. problems with ACE at one of our clients we started scheduling the ACE Scheduler to be restarted every night. After moving the ACE Scheduler to a new server, and upgrading to the latest version, I don't want to do this anymore.

       

      Instead I want to monitor the activity level / health status of the service. Ideally I would want something like the I.O. Rate tag we can use for interfaces, and just monitor that throughout the day.

       

      Anyone got a solution for this?

       

      (I know there is the possibility to restart the ace windows service upon failure, or even start another program / restart the computer. But for the problems I've seen before - the service itself didn't fail, it just stopped processing data)

        • Re: Monitoring the ACE Scheduler health status?
          mhamel

          There are many solutions to your problem. First of all, PI ACE has a series of performance counters that will keep track of the overall execution of calculations, I presented them below.

           

          - PI Advanced Computing Engine\Last Calculation ExeTime
          - PI Advanced Computing Engine\Number of aborted calculations
          - PI Advanced Computing Engine\Number of calculations executed
          - PI Advanced Computing Engine\Number of calculations on queue
          - PI Advanced Computing Engine\Number of calculations with errors
          - PI Advanced Computing Engine\Number of skipped calculations
          - PI Advanced Computing Engine\Time to complete calculation

           

          You can also monitor the % Processor Time, Private Bytes, Virtual Bytes and Working Set performance counters of the PIACENetScheduler process on the PI ACE Server machine. If the problem lies in one of the PI ACE calculation it may not be obvious to see an influence in the aforementioned counters. Each executable/library defined in an PI ACE system will spawn an instance of PIACEClassLibraryHost.exe when calculations are started. One instance of the host may instantiate many instances of the classes contained within the executable/dll file. Each context will be an instance of its class. PIACENetScheduler is responsible for spawning the host to load the DLL, but the host itself takes care of creating instances of the classes. PIACENetScheduler actually does very little; it does not monitor the instances of the host. If an instance of the host crashes, all calculations associated by that host will stop.

           

          To monitor this failure scenario, you can monitor the proper process hosting your calculation with the Elapsed Time and Virtual Bytes counters such as:

           

          - process(PIACEClassLibraryHost)\Elapsed time
          - process(PIACEClassLibraryHost)\Virtual Bytes

           

          Both counters will fail if the process is terminated. It is difficult to tell which calculation has failed. You will need to correlate the ProcessIDs reported in the PI ACE logs with the process IDs of the remaining instances of PIACEClasslibraryHost, and restart the failed executable.

           

          Another option would be to create your custom performance counters inside your calculation for which one would act a watchdog, an auto-increment of 1 each time it is called. After you can easily monitor all your calculations with the performance monitor interface writing the status into PI Tags. This might help identifying the possible source or sources of problems and let you address that.

           

          Finally, it exists a way to programmatically start and stop a module, a class or a context whenever it is necessary but it is better to understand the cause(s) of crash for the long-term solution. This solution is more like a Band-Aid on your scrape.

          • Re: Monitoring the ACE Scheduler health status?
            andreas

            Asle - have you had a look into the performance counters? there is one for the last calculation ExeTime and one for the Number of calculations executed. Monitoring those with the _total instance should give you some indication of the health.

             

            p.s. Mathieu was faster

              • Re: Monitoring the ACE Scheduler health status?
                Asle Frantzen

                Thanks guys.

                 

                I'm already logging some of the Performance counters for ACE. I haven't really spent too much time with it, but I'm sure I can cook something together.

                 

                The problem I encountered before was most likely that ACE didn't sign up for eventpipes after a network problem had occured - even though the log clearly stated that "connection to pi server is no longer in error" when the connection was reestablished. The result was that ACE didn't react to trigger tags getting new events.

                 

                Let's add a new scenario here: If this were to happen again it would effect 2/3 ACE calcs running (naturally triggered), but the third one is clock scheduled. Wouldn't there be a situation where the clock scheduled ACE calc would continue running - and the triggered calcs would be dead?

                 

                Meaning that even if I monitored the number of executions, it would be difficult to register that the triggering was off. (The two triggered ACE calcs might not get new triggered values more than once per month)

                  • Re: Monitoring the ACE Scheduler health status?
                    Lonnie Bowling

                    Hi Asle,

                     

                    I have a similar issue with one of my customer's ACE installs.  We setup PI Notifications and I monitored the ACE calculation by comparing the time stamp of a value used in the calculation (Meter_Reading in this case) with the time stamp of the output value (Usage):

                     

                    Round(((PrevEvent('Meter_Reading','*') - PrevEvent('Usage','*'))/60))   >  'ACETimeOutSP'

                     

                    Note that I used the round function, otherwise it would not work as a notification trigger.  I suspect it has to do with data typing and that round function forces it from a time value to a numeric value. ACETimeOutSP is AF attribute with a static value, the units are minutes.  In this case we are triggering the calculation with a change in value.

                     

                     

                     

                    Lonnie

                      • Re: Monitoring the ACE Scheduler health status?
                        hanyong

                        Nice suggestion Lonnie.

                         

                        There is also a knowledge base article on our techsupport website that talks a bit about how to detect situations where calculation results is not updated. It can be found here, in the last part of the article. The article basically talks a lot more on the performance monitoring of PI ACE.

                          • Re: Monitoring the ACE Scheduler health status?
                            Asle Frantzen

                            Han Yong

                            There is also a knowledge base article on our techsupport website....

                             

                             

                             

                             

                            I see this was created yesterday, but I've seen a lot of the information before. Is it just a compilation of older articles being updated for ACE 2010?

                             

                             

                             

                             

                             

                            Anyways, the article discusses monitoring certain info for each of the PIACEClassLibraryHost executables. If we turn this around and go to the PI server - I see there are performance counters - under the "PI Update-Consumer" category - which monitor certain parameters for the signups from these PIACEClassLibraryHost executables.

                             

                            Maybe there's information here I could combine with other factors into a rather complicated PI Notification setup. I have up to 100 trigger tags for some of my calculation contexts, so I would have to add these as factors to the notification logic as well. But still there are challenges - even though a trigger tag activates my notification, ACE require some time to calculate and write the output. Hmmm, I don't know...

                          • Re: Monitoring the ACE Scheduler health status?
                            Asle Frantzen

                            Lonnie Bowling

                            Hi Asle,

                             

                            I have a similar issue with one of my customer's ACE installs.  We setup PI Notifications and I monitored the ACE calculation by comparing the time stamp of a value used in the calculation (Meter_Reading in this case) with the time stamp of the output value (Usage):

                             

                             

                            Hey Lonnie

                             

                            Sorry for the late reply

                             

                            I've been thinking about something similar myself, but I haven't landed on any solution yet. I was thinking of just scheduling a simple ACE calc periodically - and then monitor the number of events per time. But as I tried to explain in my last post that really wouldn't cover the naturally scheduled calcs very good - since a problem I suspect was present was that ACE didn't sign up for eventpipes after network errors caused it to disconnect from the PI server.

                             

                            Even though the logs said that ACE had successfully reestablished the connection to the PI server, nothing happened when trigger tags had activity.

                             

                             

                             

                            I've come up with this simple setup which might be a solution:

                            1. Time scheduled ACE calc, writes random value to output tag every XX minutes
                            2. ACE calc triggered by the output tag from the other calc, writes random value to an output tag of its own

                            This setup probably works, as the first one covers if the ACE Scheduler is up and running - and the second one covers my other concern of not being properly signed up for eventpipes after a network problem. The PI Notification would then make sure 1. has events within the last XX minutes, and also that the last event time for 2. is the same as for 1.

                              • Re: Monitoring the ACE Scheduler health status?
                                Lonnie Bowling

                                Asle Frantzen @ Amitec

                                I've come up with this simple setup which might be a solution:

                                1. Time scheduled ACE calc, writes random value to output tag every XX minutes
                                2. ACE calc triggered by the output tag from the other calc, writes random value to an output tag of its own

                                This setup probably works, as the first one covers if the ACE Scheduler is up and running - and the second one covers my other concern of not being properly signed up for eventpipes after a network problem. The PI Notification would then make sure 1. has events within the last XX minutes, and also that the last event time for 2. is the same as for 1.

                                 

                                Hi Asle,

                                 

                                I like this idea, I think it should work.  One of the concerns I have though is the nature of why natural based triggers stop working.  You believe that tags are not signed up for eventpipes.  Was there a way you were able to verify this?  How can we be sure all triggers for all calculations are working?  I also have seen similar issues with the the first release of PI Notifications, where snapshot values were not triggering notifications.  Everything looked fine, but nothing happen.  I have not seen it happen with the new 2010 release so far.

                                 

                                Lonnie

                                  • Re: Monitoring the ACE Scheduler health status?
                                    hanyong

                                    I agree with Lonnie, that looks like a good idea.

                                     

                                    @Lonnie: The way to verify this is to setup some health monitor mechanism that you and Asle have been suggesting and checking the event queue sign-ups on the PI Update Manager if something amiss is detected. This is somewhat along the line of Asle's thoughts of looking at the PI Update Manager performance counters.

                                     

                                    @Asle: the article should have existed for a while now, but as an internal draft. Right before this, I was talking to Noah (the author of the article) about it, and someone mentioned in the email exchange that the article is published, I didn't really realize that it is really just published

                                     

                                    Anyway, I would say that the contents in the article are accumulation of the knowledge about how to monitor the performance and troubleshoot PI ACE over the different versions of PI ACE.

                                    • Re: Monitoring the ACE Scheduler health status?
                                      Asle Frantzen

                                      Lonnie Bowling

                                      You believe that tags are not signed up for eventpipes.  Was there a way you were able to verify this?  How can we be sure all triggers for all calculations are working?

                                       

                                      I haven't verified this, and I haven't seen this problem after moving ACE to better (virtual) hardware and also upgrading to ACE 2010 R2 SP1.

                                       

                                      Hopefully this issue was only with 2.1.32 (if it actually was an issue) - but since I'm still aware that ACE said it had reconnected to PI but triggered calculations did not react, I would like to monitor this to be on the safe side.