Monitoring and Alerting - Pragmatic Software Engineering

As a developer I hate the thought of being interrupted to deal with some messy problem on a production system. That goes even doubly so out-of-hours. However, as a member of a team which supports the services which it runs I find that I have to be in a position to support the service I develop. What follows are some of the lessons which I’ve learnt about monitoring and alerting from supporting a production system.

Monitor The Things You Care About Directly

There’s a very easy trap to fall into of relying purely on out-of-the-box monitoring to understand whether your application is misbehaving or not. Unfortunately, this leads to proxy metrics and, as a result, noisy alerts.

Example – most web applications will have some logs around access at the HTTP layer, almost for free. That could be coming from a proxy (NGINX, Apache) or from the application framework itself (e.g. Dropwizard or Spring Boot). Typically these log lines give you a relatively standard amount of information – request URI, caller IP, response code, latency etc and you can usually find them formatted in a ‘standard-ish’ format that’s almost trivially easy for log aggregators (Splunk, Elasticsearch) to consume. It’s *very* easy to create an alert based on these logs. Response code is typically a good one – where people create alerts for any 5xx responses, or 4xx responses, or some ratio over some given time. Sometimes, though, that’s just not a good measure of an application’s behaviour. I have a ReST API, for example, which uses 409 to indicate a normal condition which I expect clients to be able to handle. I use 504 to indicate back-pressure scenarios. And so on.

Now, it’s possible to add logic to my alerts to filter out those specific response codes but it’s the start of a slippery slope towards unmaintainably complex alerts. The real problem is that the HTTP response code is a proxy variable which is being used to understand multiple complex aspects of application health. Do I really care if my application response with 500? The answer is actually that it depends. Sometimes I do and sometimes I don’t. What does an individual ‘Internal Server Error’ mean in terms of a given web application? There’s a number of different possibilities – it could be a transient, unexplainable condition (e.g. a connection failure to a dependency) bubbling up as an odd behaviour in your application, it could indicate an ongoing problem within the application.

What’s a better solution? Well, explicitly logging each of those situations in your application and targeting alerting to those log events. So, if the service starts failing because of a connection issue to a dependency you’ll know

that that’s what it is
which dependency is at fault
how many requests were affected by *that* issue

So when you get an alert based on the issue you’ll have much more information to go on.

In short, just because we *can* measure something doesn’t mean that we *should* find a reason to alert on it. It’s much preferable to reason about the failure modes of your application and then find ways to alert on it – possibly driven via metrics or logging points explicitly added to the application for that purpose.

Make Alerts Actionable

Every alert, particularly ones which trigger out-of-hours, should indicate what the problem is and, as a result, what the operator should be doing about it. Alerts which don’t have a clear action associated with them will be responded to differently by different people. Some people will do the minimal amount necessary to allow them to ignore the alert, possibly risking missing a genuine issue. Some will spend hours getting to the root cause of the alert, and then making a change which may be unnecessary on the basis that something must be done.

An interesting exercise can be to go through your alerts and ask what the action is for each one. If the answer is ‘investigate why this is happening’ then you’re probably missing some lower level alert based on a metric which actually identifies the problem. Where there’s some ambiguity about whether an alert actually always identifies a problem then the alert itself should be considered noisy and be replaced *in favour* of the lower level alerts which directly identify real problems.

Automate Recovery

For a highly available service, one with availability requirements measured in numbers of ‘9’s (e.g. 99.99%), if manual intervention is needed to recover then the service cannot meet its SLA. Human intervention is untimely and failure-prone – we *can’t* allow it as a normal part of our service lifestyle.

So, in monitoring and alerting terms, it’s just not good enough to detect failure and call for help. The system *must* be able to deal with some of the more predictable failure modes – node failure, data centre failure, network failure. There are plenty of tools to help with this – autoscaling groups in AWS, replicasets in K8s, healthchecks, circuit-breakers, statelessness, redundant physical network connections and scaling across multiple geographic locations.

Monitor Recovery Actions

If you’ve automated recovery though, it’s important to be aware that some recovery actions are (or were) preventable so you *should* track them. OOM conditions, null pointer exceptions, resource leakages and more can result in an automated recovery action as a result of a problem within the application so they should generally be investigated but not in the critical path of the application.

Look For Predictive Alerts

The best form of alert is the one which tells you that failure *will* happen without intervention. These typically can be tuned such that they don’t even need to trigger out-of-hours callouts. For a highly available service it’s important for manual interventions to be requested before failure to avoid breaching availability SLAs.

It’s great to have alerts for when the service *has* failed but it’s almost impossible to intervene manually and maintain a service’s availability SLA. The idea is to avoid having those alerts fire by having predictive alerts which identify situations where manual intervention *will* be required *before* it’s required. Disk space is a simple example – if the alerting waits until the disk is >95% full before triggering then the service risks becoming unavailable. By dialling the alert down to ~70% the support team has time to intervene within normal working hours.

Of course, any kind of manual intervention should be avoided so it’s also usually best to avoid having conditions which *require* manual intervention.

Use Semantic Monitoring

All metrics are really a proxy for user experience – does the user’s experience of your service meet the SLA which you claim to provide. Anything which we measure from within the application is only, at best, an approximation which can help answering that question but none of them tell the whole story. Request latency is a great metric but it tells me nothing about the network access *to* the application. The number of available instances can tell me that the service was always available but it can’t tell me that the application was unavailable due to resource starvation, or deadlock, or some sort of bug or internal problem. CPU, Disk and memory usage can tell me that a particular instance has resources available but they don’t tell me what’s happening with the resources in use. The net result is that all alerts driven from internal metrics are prone to both false negatives and false positives.

Semantic monitoring aims to address this issue by moving monitoring out of the application entirely. By driving the web application *as if* you were a customer, making real API requests against the application, you can see how your application *appears* to end users. The closer we can make the test infrastructure model the customer’s geographic spread the closer we can make the monitoring accurately represent the customer experience.

This sort of monitoring is great as a last line of defence for a service – it *definitely* tells you that a service appears to be unavailable. Unfortunately, it tells you almost nothing about what the problem is. Also, *if* this sort of monitoring ever indicates an issue then you most likely have a customer-visible issue and are likely to be breaching an SLA. So, it’s best to consider alerts based on semantic monitoring as big problems. In a well monitored system there should be a suite of lower-level predictive alerts which will prevent the semantic monitoring ever failing. If problems internal to a service ever cause a semantic monitoring alert to fire then there’s already an issue in the monitoring.

Still, semantic monitoring alerts can provide a useful view of problems outside of your control which customers may themselves be experiencing and which they may view as problems within your service. At least the semantic monitoring gives you an opportunity to address / diagnose / resolve the problem before customers start calling.

Don’t Alert On Vanity Metrics

Many of the things which we can measure and graph are not necessarily useful for alerting purposes but which are still valuable metrics to collect and put on a dashboard. These kinds of metrics are known as vanity metrics and although they aren’t useful for alerting they can be very useful for understanding the dynamics of a working system, as well as making for impressive dashboards for sales or marketing purposes. Request rates don’t actually tell us much which can be actioned on, for example, but they do make for very good graphs to demonstrate an understanding of traffic distribution. Similarly, breakdowns of traffic by client location, client identity, or by different parts of the API being accessed can all provide visibility over stressors which *might* be valuable in investigating issues to identify underlying causes.