I believe you should be able to monitor for SQL deadlocks using the included performance counters with MS SQL. Unfortunately, you probably won't get granularity to see specifically whether the deadlock is with PIFD. You could also look at the Health performance counter included with the PI AF service. Both of these can be monitored for changes with a perfmon interface, or some other perfmon counter monitoring application.
1 of 1 people found this helpful
You mention deadlock, but also the AFSDK timeout. Is this a deadlock or a severe blocking issue? A deadlock is what happens when tasks are permanently blocking each other. In this situation the SQL engine picks a victim and terminates it. This is different from a severe blocking issue where tasks wait behind another long running transaction. Can you be more specific about the behaviors/messages/errors you saw?
We try to prevent deadlocks, but since this isn't 100% foolproof, the PI AF Server will retry deadlock victims (as defined by deadlockRetryCount, default = 10). If you are seeing recurring deadlocks then you may want to generate a deadlock graph to get more information. One approach to get this information is with SQL Extended Events. There are quite a few resources online that help explain how to set this up (How to monitor deadlock using extended events in SQL Server 2008 and later – Microsoft SQL Server Tips & Tricks ).
Based on your opening post, I suspect what is happening is not deadlocking, but severe blocking. Specifically, your AF task was stuck waiting behind something the maintenance job was doing and the client timed out waiting for a response from the server. You will want to take a closer look at the blocking transaction. Tools to do this include things like sp_who2, sp_whoisactive, and the All Blocking Transactions database report. Do these implicate the maintenance job? If so, which part of the job? How long does it take the blocker to complete (this might be tracked with additional event tracing)? In situations where you have two heavy workloads butting heads with each other it may also be worthwhile to see if you can offset their timing (typically, the maintenance job is scheduled during periods of inactivity).
For recurring problems, you may want to open a case with OSIsoft Technical Support; supply your version of PI AF Server, version and edition of SQL Server, and any supporting information (stuff like description of behavior/messages/errors, deadlock graphs, SQL event tracing, etc.).
Thanks for your replies,
I agree it's more likely to be a severe blocking issue directly related to the AF Maintenance Job being called when my other apps were actively doing Event Frames related transaction... the PIFD is installed on an AlwaysOn cluster where reside +20 other application database and the DBA is already monitoring / managing deadlocks and transaction log growth.
The thing is my apps handle multiple 24/7 process units in multiple factories, there is simply no timeframe where I can be assured to avoid conflicts... To what I have seen, this AF Maintenance Job, mostly the REINDEX part, needs to run at least every month or the system becomes slow to non-responsive. Hopefully the next version will improve / fix this performance issue.
You were asking about the system behavior where this happened. Here is a screenshot of the RPC metrics when they noticed me about the issue (3h after the event)
The RPC deltas per call looks pretty bad for several RPCs. If this reoccurs we should take a closer look with sp_who2 or sp_whoisactive to get more information on what is slow and why it is slow.
Since the AF SQL database resides within an AlwaysOn Availability Group, you should be running Enterprise Edition of SQL Server 2012 or later. Assuming you are running AF 2015 (2.7) or later, this suggests the maintenance job should be doing an online index operation. This reduces the need for long-term table locks, but may still cause blocking during finalization. Use of the aforementioned tools might help identify this.
What frequency do you run the maintenance job?
The Maintenance Job was scheduled once a week and it is now fully disabled because the problem mentioned above happened once more last week. it seems that my c# windows services conflicts with the re indexing job and everything related to EF gets blocked "forever" or until I stop the job manually. Unfortunately, this issue occurred both time in production and there is no time to investigate, only to resolve and apologize ☺
Any idea if future version of AF will avoid the maintenance job?