Eugene Lee

AF Server container health check

Blog Post created by Eugene Lee Employee on Aug 23, 2018

Note: Development and Testing purposes only. Not supported in production environments.

 

Link to other containerization articles

Containerization Hub

 

Introduction

In a complex infrastructure which spans several data centers and has multiple dependencies with minimum service up-time requirements, it is inevitable that services can still fail occasionally. The question then is how we can manage that in order to continue to maintain a high availability environment and keep downtime as low as possible. In this blog post, we will be talking about how we can implement a health check in the AF Server container to help with that goal.

 

What is a health check?

A container that is running doesn't necessarily mean that it is working. i.e. performing the service that it is supposed to do. In Docker Engine 1.12, a new HEALTHCHECK instruction was added to the Dockerfile so that we can define a command that verifies the state of health in the container. It is the same concept as a health check for humans such as making sure that your liver or kidney is working properly and take preventative measures before things go worse. In the container scenario, the exit code of the command will determine whether the container is operational and doing what is it meant to do.

 

In the AF Server context, we will need to think about what it means for the AF Server to be 'healthy'. Luckily for us, we have such a counter to indicate the health status. AF server includes a Windows PerfMon counter called AF Health Check. If both the AF application service and the SQL Server are running and responding, this counter returns a value of 1. Another way we can check for health is to check if a service is listening on the port 5457 since AF Server uses that. We can also test if the service is running. Including all of these tests will make our health check more robust.

 

Define health tests

For the first measure of health, we will be using the Get-Counter Powershell cmdlet to read the value of the performance counter. A healthy AF Server is shown below.

A value of 1 indicates that the AF Server and SQL Server are healthy while 0 means otherwise.

 

The second measure of health is to test for a service listening on port 5457. We will use the Powershell cmdlet Get-NetTCPConnection to do so.

When there is no listener on port 5457, we will get an error.

 

The third measure of health is to check if the service is running by using the Get-Service Powershell cmdlet.

 

Integrate into Docker

With the health tests on hand, how can we ask Docker to perform these tests? The answer is to use the HEALTHCHECK instruction in the Dockerfile to instruct the Docker Engine to carry out the tests at regular intervals that can be defined by the image builder or the user. The syntax of the instruction is

 

HEALTHCHECK [OPTIONS] CMD command

 

The options that can appear before CMD are:

  • --interval=DURATION (default: 30s)
  • --timeout=DURATION (default: 30s)
  • --start-period=DURATION (default: 0s)
  • --retries=N (default: 3)

 

For more information on what the options mean, please look here.

I will be using a start-period of 10s to allow the AF Server sometime to initialize before starting the health checks. The other options I will leave as default. The user of the image can still override these options during Docker run.

 

The command’s exit status indicates the health status of the container. The possible values are:

  • 0: success - the container is healthy and ready for use
  • 1: unhealthy - the container is not working correctly
  • 2: reserved - do not use this exit code

 

The command will be a batch file that runs the aforementioned tests. The instruction will therefore look like this.

HEALTHCHECK --start-period=10s CMD powershell .\check.ps1

 

Here are the contents of check.ps1

#test for service listening on port 5457
Get-NetTCPConnection -LocalPort 5457 -State Listen -ErrorAction SilentlyContinue|out-null
if ($? -eq $false)
{
write-host "No one listening on 5457"
exit 1
}

#test if AF service is running
$status = Get-Service afservice|select -expand status
if ($status -ne "Running")
{
write-host "PI AF Application Service (afservice) is $status."
write-host "PI AF Application Service (afservice) is not running."
exit 1
}

#test for AF Server Health Counter
$counter = get-counter "\PI AF Server\Health"|Select -Expand CounterSamples| Select -expand CookedValue;
if ($counter -eq 0)
{
write-host "The health counter is $counter. This might mean either"
write-host "1. SQL Server is non-responsive"
write-host "2. SQL Server is responding with errors"
exit 1
}

 

Usage

The container image elee3/afserver:18x has been updated with the health check ability. After pulling it from the Docker repository with

docker pull elee3/afserver:18x

 

You can have some fun with it. Let me spin up a new AF Server container based on the new image.

docker run -d -h af18 --name af18 elee3/afserver:18x

 

Now, let's do a

docker ps

 

Notice that my other container af17 that is based on the elee3/afserver:17R2 image doesn't have any health status next to it status because a health check was not implemented for it while container af18 indicates "(health: starting)". Let's run docker ps again after waiting for a little while.

Notice that the health status has changed from 'starting' to 'healthy' after the first test which is run interval (configured in options) seconds after the container is started.

 

We can also do

docker inspect af18 -f "{{json .State.Health}}"|ConvertFrom-Json|select -expandproperty log

to see the health logs.

 

Health event

When the health status of a container changes, a health_status event is generated with the new status. We can observe that using docker events. We will now intentionally break the container by stopping the SQL Server service and trying to connect with PSE.

This is expected. Now let us check using docker events which is a tool for getting real time events from the Docker Engine.

 

We can do a filter on docker events to only grab the health_status events for a certain time range so that we do not need to be concerned with irrelevant events. Let us grab those health_status events for the past hour for my container af18.

(docker events --format "{{json .}}" --filter event=health_status --filter container=af18 --since 1h --until 1s) | ConvertFrom-Json|ForEach-Object -Process {$_.time = (New-Object -Type DateTime -ArgumentList 1970, 1, 1, 0, 0, 0, 0).addSeconds($_.time).tolocaltime();$_}|select status,from,time

 

Also check on

docker ps

 

and also docker inspect which can give us clues on what went wrong.

docker inspect af18 -f "{{json .State.Health}}"|ConvertFrom-Json|select -expand log|fl

 

With the health check, it is now obvious that even though the container is running, it doesn't work when we try to connect to it with PSE.

We shall restart the SQL Server service and try connecting with PSE. We can check if the container becomes healthy again by running

 

docker ps

and

(docker events --format "{{json .}}" --filter event=health_status --filter container=af18 --since 1h --until 1s) | ConvertFrom-Json|ForEach-Object -Process {$_.time = (New-Object -Type DateTime -ArgumentList 1970, 1, 1, 0, 0, 0, 0).addSeconds($_.time).tolocaltime();$_}|select status,from,time

As expected, a new health_status event is generated which indicates healthy.

 

Conclusion

We can leverage on the health check mechanism further when we use a container orchestrator such as Docker Swarm that can detect the unhealthy state of a container and automatically replace the container with a new and working container. This will be discussed in a future blog. So stay tuned!

Outcomes