5 Replies Latest reply on Jun 16, 2017 1:57 PM by acote

    Monitoring SQL deadlock

    acote

      Hello PISquare,

       

      I have multiple real-time applications (mainly C# Windows Services) built over AF and last Sunday, they all went down because of the "OSIsoft AF Maintenance" SQL job on the PIFD database that reached a deadlock state. I will take care of this one but this remind me that we aint fully protected against this kind of global failure and what I would like to know is:

       

      • Is there a way to monitor deadlocks on PIFD?
      • Is there a AFSDK class that monitor AF connection to SQL database?
        • If not, have you ever implemented some kind of watchdog that would continuously check the connection state to the SQL database.
      • Plan C, I thought of using the System.TimeoutException I received when playing with EFs (SetEndtime(), CheckIn(), etc.) but I don't like this as it require to reach a long timeout and I need quick response time.
        • Ex: System.TimeoutException: This request operation sent to net.tcp://piafserver.{domain}:5457/AFServer/Service did not receive a reply within the configured timeout (00:05:00). The time allotted to this operation may have been a portion of a longer timeout.  This may be because the service is still processing the operation or because the service was unable to send a reply message.  Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client.

       

      My objective is have my apps know when they loose comm with the SQL server / database and act accordingly (Generate alarm, disable processing, send email, etc)

       

      Thanks for your help

       

      Alex

        • Re: Monitoring SQL deadlock
          gachen

          Hi Alex,

           

          I believe you should be able to monitor for SQL deadlocks using the included performance counters with MS SQL. Unfortunately, you probably won't get granularity to see specifically whether the deadlock is with PIFD. You could also look at the Health performance counter included with the PI AF service. Both of these can be monitored for changes with a perfmon interface, or some other perfmon counter monitoring application.

          • Re: Monitoring SQL deadlock
            scheung

            You mention deadlock, but also the AFSDK timeout. Is this a deadlock or a severe blocking issue? A deadlock is what happens when tasks are permanently blocking each other. In this situation the SQL engine picks a victim and terminates it. This is different from a severe blocking issue where tasks wait behind another long running transaction. Can you be more specific about the behaviors/messages/errors you saw?

             

            We try to prevent deadlocks, but since this isn't 100% foolproof, the PI AF Server will retry deadlock victims (as defined by deadlockRetryCount, default = 10). If you are seeing recurring deadlocks then you may want to generate a deadlock graph to get more information. One approach to get this information is with SQL Extended Events. There are quite a few resources online that help explain how to set this up (How to monitor deadlock using extended events in SQL Server 2008 and later – Microsoft SQL Server Tips & Tricks ).

             

            Based on your opening post, I suspect what is happening is not deadlocking, but severe blocking. Specifically, your AF task was stuck waiting behind something the maintenance job was doing and the client timed out waiting for a response from the server. You will want to take a closer look at the blocking transaction. Tools to do this include things like sp_who2, sp_whoisactive, and the All Blocking Transactions database report. Do these implicate the maintenance job? If so, which part of the job? How long does it take the blocker to complete (this might be tracked with additional event tracing)? In situations where you have two heavy workloads butting heads with each other it may also be worthwhile to see if you can offset their timing (typically, the maintenance job is scheduled during periods of inactivity).

             

            For recurring problems, you may want to open a case with OSIsoft Technical Support; supply your version of PI AF Server, version and edition of SQL Server, and any supporting information (stuff like description of behavior/messages/errors, deadlock graphs, SQL event tracing, etc.).

            1 of 1 people found this helpful
              • Re: Monitoring SQL deadlock
                acote

                Thanks for your replies,

                 

                I agree it's more likely to be a severe blocking issue directly related to the AF Maintenance Job being called when my other apps were actively doing Event Frames related transaction... the PIFD is installed on an AlwaysOn cluster where reside +20 other application database and the DBA is already monitoring / managing deadlocks and transaction log growth.

                The thing is my apps handle multiple 24/7 process units in multiple factories, there is simply no timeframe where I can be assured to avoid conflicts... To what I have seen, this AF Maintenance Job, mostly the REINDEX part, needs to run at least every month or the system becomes slow to non-responsive. Hopefully the next version will improve / fix this performance issue.

                 

                You were asking about the system behavior where this happened. Here is a screenshot of the RPC metrics when they noticed me about the issue (3h after the event)

                RPC Metrics.png

                 

                Alex

                  • Re: Monitoring SQL deadlock
                    scheung

                    The RPC deltas per call looks pretty bad for several RPCs. If this reoccurs we should take a closer look with sp_who2 or sp_whoisactive to get more information on what is slow and why it is slow.

                     

                    Since the AF SQL database resides within an AlwaysOn Availability Group, you should be running Enterprise Edition of SQL Server 2012 or later. Assuming you are running AF 2015 (2.7) or later, this suggests the maintenance job should be doing an online index operation. This reduces the need for long-term table locks, but may still cause blocking during finalization. Use of the aforementioned tools might help identify this.

                     

                    What frequency do you run the maintenance job?

                      • Re: Monitoring SQL deadlock
                        acote

                        Hi Sam,

                         

                        The Maintenance Job was scheduled once a week and it is now fully disabled because the problem mentioned above happened once more last week. it seems that my c# windows services conflicts with the re indexing job and everything related to EF gets blocked "forever" or until I stop the job manually. Unfortunately, this issue occurred both time in production and there is no time to investigate, only to resolve and apologize ☺


                        Any idea if future version of AF will avoid the maintenance job?

                         

                        Thanks

                         

                        Alex