this is a tough question. I prefer that people troubleshooting PI have some sort of PI training (you know, as someone how is infected with the PI 'virus' you always want to make sure your software get's the best treatment affordable ).
Anyhow I see the requirement here but nevertheless there is not much you can do beside restarting a service that is not running (and make sure the guys don't complain that the shutdown subsystem is not running )
What the Server Support Team clearly can do is monitioring the system and making sure that they call somebody if there is something more to fix than starting the service. In some cases starting the service or even rebooting the system will not help you at all (corrupted files etc.).
As a guidance for the performance counters I suggest using the template provided with SMT. Setup the PI Performance Monitor Interface on the PI Server (use the ICU) and then use PI SMT 3 IT Points> Performance Monitor Points plug-in to create the tags and record the data to PI (If you use SMT 3 you can load a template for a PI Server that shows you all tags we consider as important for general monitoring) - monitor your system so that you have a reasonable baseline for the values. This would give your support team a basic idea of the health of the PI system.
You might want to take a look at:
To read about performance monitoring and increasing the reliability of the PI System.
The training material is as well available:
and can provide you with some material to train your support team.
I prefer that people troubleshooting PI have some sort of PI training (you know, as someone how is infected with the PI 'virus' you always want to make sure your software get's the best treatment affordable ).
I couldn't agree with this more, especially if the PI system is critical to operations.
Maybe if all the IT people joined vCampus and took advantage of the PI System Manager I online CBT in the Training Center. Would give them enough knowledge of PI to understand what they are stopping/starting.
Thanks for the input guys
I am under no illusions about the use of untrained IT support in this context hence my plan to limit them to restarting services and rebooting - outwith office hours. This is pretty much just on the hope that it may work, because we don't have PI 24x7 support, and I aint volounteering...
I already have the PerfMon stuff from DevNet and various other things available in ProcessBook screens.
In terms of surveillance of Performance Counters I am thinking about queue size for PI, and I/O rates on the basis that my most likely failure is loss / disconnection of the data sources. Most of the failure modes I've seen recently have been around buffering. Do I just have weak spot there, or are there any other specific failures modes that are common.
The next step I am thinking about is how the PI team monitor for flat-line data.
- Right now we have a page of trends (noisy tags), so that it's easy to see what is / isn't updating
- I am think about adding PE, which look at the nmber of archive values in he last few hours, so that I can generate alarm states for animating a network diagram.
- Anybody tried this, or got other ideas?
- (I need to look at tags for this because each data source is a DCS, and has multiple data sources internally.)
Is AF an option for you?
What I did recently is create a monitoring system using PI-AF, PI-OLEDB & Perfmon. PI-AF was installed with PI-Notifications (linked through to Active Driectory) for the alerting side of the monitoring. Notifications were based on behaviour changes in a whole host of perfmon tags, NetworkManager in particular.
For data sources I took 2 approaches.
1) PointSource based quality (Bad tags) and updates (Stale tags). For each pointsource create tags to store the Bad & Stale quality % and pull these into AF to be monitored & notifications sent accordingly. What to look for here is behaviour changes, so the quality dipping by a certain % etc What we initially noticed after implementing this is some systems update rates are lower than others, so no point saying all data sources have to have >= 90% updates within last 5 minutes. You get to see some nice trends of how one DCS system connection failure has a knock on effect with the other point source (performance equations!).
2) Overall system quality & updates. Similar to 1 but when you add in multiple servers across an enterprise you get a nice high level overview as a starting point.
Used PI-OLEDB to pull out the Stale & Bad tags (examples of this in the PI-OLEDB user guide) and wrote them to tags...very simple but very effective.
Don't forget about ISU too for interface status.
Is an Enterprise Agreement an option? What I am getting at here is ManagedPI, so you and/or your PI support engineers won't have to be up all night, let OSI do that.
Right now my options are limited to what I can do internally with standard PI functionality.
The client has been using PI for some time in an enclosed context, but things are changing very quickly now, so EA and 24x7 Support are under consideration.
We have one site which has plans to use AF / Notifications primarily to monitor a very large / complex data colelction network, so the outcome of that will propable help shape the course for the other sites in the longer term.