Health Monitoring of Azure Web Apps with Sitecore

Recently when diagnosing an issue with a Sitecore website hosted on Azure Web Apps, I noticed an intermittent issue where we were seeing the application going up and down frequently. i.e. up for a minute, down for two, up, down, up, down.

Investigation led me to see that it was one instance of the Web App that was having issues. With Azure Web Apps it is not easy to see quickly that you have a problem with just one instance, but after a bit of digging I could see that occasionally an instance was exceeding the 90% memory threshold for a prolonged period of time which in turn initiates an application recycle. Every so often the site would not come back up after the restart. The stats showed that the w3wp process is running but at near 0% utilisation.

I initially thought that Azure would automatically detect that the instance was unhealthy and stop sending traffic to it. However after several times of this happening it was clear that this was not the case as we would always see the same symptoms (frequent switching between up and down on our monitoring).

In looking into monitoring health of Web Apps I found that Azure do have a feature to handle this and it has only been around since August last year and is currently in preview. It is called Health Check. AT the time of writing it is still being refined but it currently pings a given URL every 2 minutes and if it doesn’t receive a healthy response after 5 concurrent attempts (i.e. 10 minutes) it will remove the instance from the load balancer. If it still is unhealthy after an hour it will restart the instance. This is exactly what we needed and quickly I had this in place. For a quick win I simply pointed this at a search result page as this tests that pages are working along with Azure Search queries. You would want to implement a custom URL that tests all of the key features of your website and return the correct status code accordingly.

healthCheckPath setting.

Sitecore Gotcha

Naturally, being Sitecore there was a gotcha. After following the instructions to getting it set up I was seeing that both my instances were unhealthy. I suspected that a redirect could be getting in the way of returning the required 2xx status code. After a bit of digging I found that the issue was the health check requests were coming in on a different hostname and Sitecore was not serving this to the “website”. Instead it was falling under the remit of the “scheduler” site (from the <sites> configuration), and redirecting to the page not found page.

There are many ways you could deal with this but the simplest for me was to add the hostnames that Azure was using, to the “website” configuration in the <sites> section. A quick bit of logging revealed that it uses the application name (i.e. the default address for your Web App but without the “.azurewebsites.net” part). I added these to the “hostName” parameter.

Once you get past the Sitecore gotcha it’s a piece of cake and a crucial addition to any solution. I had assumed that this all just happened out of the box but with the root being hit, but this goes to show you should never assume.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s