Recently when diagnosing an issue with a Sitecore website hosted on Azure Web Apps, I noticed an intermittent issue where we were seeing the application going up and down frequently. i.e. up for a minute, down for two, up, down, up, down.
Investigation led me to see that it was one instance of the Web App that was having issues. With Azure Web Apps it is not easy to see quickly that you have a problem with just one instance, but after a bit of digging I could see that occasionally an instance was exceeding the 90% memory threshold for a prolonged period of time which in turn initiates an application recycle. Every so often the site would not come back up after the restart. The stats showed that the w3wp process is running but at near 0% utilisation.
I initially thought that Azure would automatically detect that the instance was unhealthy and stop sending traffic to it. However after several times of this happening it was clear that this was not the case as we would always see the same symptoms (frequent switching between up and down on our monitoring).
In looking into monitoring health of Web Apps I found that Azure do have a feature to handle this and it has only been around since August last year and is currently in preview. It is called Health Check. AT the time of writing it is still being refined but it currently pings a given URL every 2 minutes and if it doesn’t receive a healthy response after 5 concurrent attempts (i.e. 10 minutes) it will remove the instance from the load balancer. If it still is unhealthy after an hour it will restart the instance. This is exactly what we needed and quickly I had this in place. For a quick win I simply pointed this at a search result page as this tests that pages are working along with Azure Search queries. You would want to implement a custom URL that tests all of the key features of your website and return the correct status code accordingly.
Naturally, being Sitecore there was a gotcha. After following the instructions to getting it set up I was seeing that both my instances were unhealthy. I suspected that a redirect could be getting in the way of returning the required 2xx status code. After a bit of digging I found that the issue was the health check requests were coming in on a different hostname and Sitecore was not serving this to the “website”. Instead it was falling under the remit of the “scheduler” site (from the <sites> configuration), and redirecting to the page not found page.
There are many ways you could deal with this but the simplest for me was to add the hostnames that Azure was using, to the “website” configuration in the <sites> section. A quick bit of logging revealed that it uses the application name (i.e. the default address for your Web App but without the “.azurewebsites.net” part). I added these to the “hostName” parameter.
Once you get past the Sitecore gotcha it’s a piece of cake and a crucial addition to any solution. I had assumed that this all just happened out of the box but with the root being hit, but this goes to show you should never assume.