Does anyone have any recommendations for what surveillance and corrective actions can be passed to an organisations' IT Server Support Team?
We have a 24x7 Server Support Team that use BMC Patrol to monitor servers, and I am investigating whether we can get this team to provide some level of monitoring of PI as well. (We only have PI Support available during office hours.) This team has no PI knowledge, so I need to keep things as simple as possible.
My initial thinking is that we will get this Team to monitor the list of PI Services I give them and notify us if anything stops; however, I would like to be able to give them more detailed instructions for actions they could take when a problem occurs outwith office hours. Obviously this is a bit more complicated, because while some services could just be restarted if they have stopped, others are interconnected and may require more complex actions.
- All actions are manually initiated, i.e. we need to supply a help file which says what to be done for each trigger condition detected.
- I am thinking of just classifying the services into two groups for those that can be restarted, and those that require a reboot (as the simplest way to do a full PI restart); but if anyone has more detailed recommendations i would be interested.
- The Server Support Team can monitor Performance Counters too, I am thinking about looking at I/O Rates and Queue sizes etc, but any other sugegstions would be appreciated.