Relic Solution: Avoiding Alert Fatigue

Alert Fatigue

A common scenario I see with the customers I work with is that they want to get a handle on their alerts. Typically, they want to get from this:

To this:

Believe it or not, all it took to go from the first image to the second was a bit of renaming, and a single configuration change - and for the majority of customers things are no different. The biggest cause of alert fatigue is not buggy applications, but how organisations and teams manage their alerts and incident responses.

When working with customers on their alerting configurations here are the 5 main pieces of advice I have found to be almost universally helpful:

1. Alerting should start with the people who are going to be notified

Often when I’m training new teams on our Alerts platform they start off by thinking about the apps and services they care about, and then try to think of things that could go wrong and how they could alert on them. This is, to my mind, putting the cart before the horse. At the end of the day, the fundamental purpose of alerting (for the majority of customers) is to deliver notifications to people. Because of this, it makes a lot of sense to start with those people as your base and build out your alerting configuration from there

As an example, if you are a DBA, your team may be responsible for multiple databases and the hardware they run on. You may then think that you will need to have multiple policies, but in many cases it makes more sense to just have one. After all, the notifications are all going to end up going to the same team and likely the very same people. You may think that you would want separate notifications letting you know that both MySQL-1 and MySQL-2 have gone down, but if the end result in both cases is that the same team member is going to have to drop what they’re doing and go investigate, then sending the extra notification doesn’t help much. In fact, splitting things up into multiple policies can actually obscure the underlying problem (that DB-server-1 has run out of memory).

Basing policies on teams can also help those teams to be more agile. If your team has a policy that you own covering all the components you are interested in, then you are empowered to make changes to that policy. The same is not true if your alerts live in a massive “Backend Infrastructure Policy” that spans your whole organisation.

2. Set actionable thresholds

Probably the most common question I get from customers regarding Alerts is “what thresholds should I set?”. This can be tricky to get right. If you have preexisting SLAs or SLOs then these often make good starting points - for example if you have an SLA to keep average response time below 1 second then that translates directly into an alert condition you can set. However, many customers either don’t have these, or the SLAs they do have don’t translate directly into alert conditions.

A common mistake I see at this stage is to take the typical behaviour of an entity and apply a threshold that is just outside this range. For example, you may see that a host typically has about 30% CPU usage and decide to set a threshold at 60% CPU usage. However this is not a great approach, because while a change in CPU usage from 30% to 60% might be notable it is not necessarily actionable.

In most cases, you can apply a fairly straightforwards rule of thumb when setting Alert thresholds: “Would I want to be woken up at 3 am because of this?”. If something can be safely ignored at 3am, then arguably it can be safely ignored at any other time of the day - and probably will be. Critical thresholds should be reserved for problems that need to be fixed right now. Anything else can be monitored using warning thresholds.

Speaking of fixing things…

3. Actually fix the problem!

It’s not uncommon in accounts with many users to see alert incidents that have been open for upwards of six months. While this almost certainly means there are a few conditions that break the “actionable thresholds” rule above, there is also an implicit assumption in our Alerts platform that problems will get fixed within a relatively short timeframe. In other words, we assume that the general workflow looks something like this:

receive alert notification > investigate the problem > work on the problem until it’s fixed

Failing to fix the problem (and leaving incidents and violations sitting open) can lead to all kinds of unintended behaviour. For example, if you are using incident rollups (and I strongly believe that you should!) then not only does the initial problem go unresolved, but other future problems can go undetected as long as that initial incident remains open.

4. Use Incident Rollups

Users coming to us from other systems sometimes have the assumption that every violation of an alert condition will send them a notification. In other words, if I have 20 servers, and I set up a condition against all of them, I could get up to 20 notifications. This isn’t the default behaviour in New Relic Alerts - and in fact this is the most noisy option available. We offer three settings for how noisy you want your policy to be, which we call ’incident preferences’:

The situation with the 20 servers described above is what would happen under the third setting ‘By condition and entity’. However the default option for new alert policies is the first one ‘By policy’ and ideally this is the one we believe leads to the least alert fatigue.

This setting rolls every violation that happens at the same time in the same policy up into a single incident, and sends only a single notification to let you know that the incident has started. This is obviously far less noisy than the other options, and it relies on 3 core assumptions to work effectively:

  1. Problems rarely happen in isolation. In other words whenever something breaks other things are likely to break as a consequence. This is especially true in modern distributed systems where you may have many different services that talk to and rely on a given component. From an alerts perspective this just means that one alert is likely to be followed by several more as the initial problem causes knock-on effects in other systems. Since the cause for all of these alerts is the same it makes sense to group them together.
  2. Once someone starts looking at the problem they keep working on it until it’s fixed. Since you will have an engineer working on the problem constantly, they can keep track of the additional failures that happen without everyone else on their team being constantly interrupted by notifications (the failures will still pop up in our UI, they just don’t push out email notifications). If things escalate, the engineer working on the problem can reach out and request additional help from their team, but for the most part the rest of their team can go about their work uninterrupted.
  3. Notifications go to a single team. The above process only works in the context of a single team. It two or more teams are sharing a policy it breaks down, particularly if components belonging to Team A, but that aren’t important to Team B, end up breaking. In this case, Team B would essentially be flying blind as they won’t get any notifications now if something they do care about breaks. Again the best solution here is to split these teams up and give them their own separate policies. If there’s some resource they both care about there’s nothing stopping them from duplicating the same condition in both of their policies.

5. Names matter

Once alerts have been configured and are going to the right people, the next objective is often reducing MTTR. Having a robust naming convention can really help in this regard. Often, the condition and entity names are going to be the first things that an engineer sees when investigating an issue, and as a result this will inform where they start their investigation.

For example compare these, taken from two email subject lines:

  • ’Ping monitor’ 'Check failure’
  • ’Storefront - String validation’ 'String mismatch on

The first could really mean anything - it could indicate that the whole site has gone down, or just a specific page. This might lead an engineer to check the health of their backend apps & servers and then maybe to wonder if there’s a problem with their CDN when everything looks healthy. With the second subject though, it’s immediately apparent what the problem is (a recent deploy took away the string they were checking for). In this case they can simply update the monitor or disable the condition, resolving the problem in a minimal amount of time.

On a related note, less is very often more when it comes to channels for alert notifications. Once an engineer gets a notification we really want to minimise the number of distractions while they are investigating the problem. If you have notifications coming in by email, Slack, and PagerDuty all at the same time, this is only going to confuse things. If possible pick just one source of notifications (such as PagerDuty) and stick with it.