Alert when multiple synthetic monitors are failing

Hi there.

I have a large suite of synthetic monitors, and I’m trying to create two levels of alert from them - a failure in a single monitor would be a warning but a failure in multiple monitors at the same time would be a critical.

Obviously the warning is simple, just put the synthetic checks into an alert and fire the alert on a per-condition basis.

The critical is more difficult. I’ve tried the following NRQL query:

SELECT count(*) from SyntheticCheck WHERE result != 'SUCCESS'

The problem is, this only fires when two or more monitors fail at exactly the same time. This is because my monitors run on schedules and don’t necessarily run at exactly the same time, which means that I can get multiple single-monitor failures but no combination critical alert unless the monitors fail at exactly the same time.

I’ think that the above is looking at the current value, which would be null if the monitor is not running. I want my alert to look at the latest result rather than the current one, but I have been having some trouble getting the latest() function to do what I want.

How does new relic implement the red/green statuses in the UI for synthetic monitors? Could I use this technique to create my alert?

If not, how would I achieve what I want?

Hi @rsaunderson,

If I’m understanding you correctly, it sounds like you want to configure your alerts policy so that a warning threshold is violated if one monitor fails and a critical threshold is violated if multiple monitors fail.

From the looks of things, it might be easiest to implement this sort of functionality using a multi-location Synthetics alert condition. This feature was added to our product recently and I believe you should be able to achieve your desired behavior by utilizing it.

Let me know if this solves your issue or if you’d like further assistance!

Hi @rsaunderson,

Upon further consideration, I think I found some resources that might better assist you here. You might consider using Sum of query results to achieve a solution that will let you define a condition that alerts on multiple failures across multiple monitors within a given time frame. While the example here applies to a single monitor, the basic concept is illustrated further in this Relic Solution Post:

Using the query I posted above as an example, if your Synthetic monitor’s frequency was once every 5 minutes, you could set up a threshold like Sum of query results is less than 1 at least once in 5 minutes. This will keep a rolling total of SUCCESS results and only open a violation if there are no successes in the last 5 minutes.

I would try looking into adjusting your NRQL condition and tweaking it to follow this sort of format in order to fulfill your use case here. Let me know if this helps!

1 Like

Hi Masen, unfortunately the multi-location idea isn’t relevant, because we’re talking about failures in separate monitors, rather than one monitor running in seperate places.

I’m going to trial the Sum of query results alert condition. I’m also going to change the query from count(*) to uniqueCount(monitorName), which hopefully will give me failures in multiple monitors rather than counting multiple failures in the same monitor toward the sum of query results.

Thanks for your help.

Hey there @rsaunderson - wanted to check in and see if that query worked for you or if we should take another look together.

@hross - it’s better than it was, but ultimately we’re still seeing the same issue. I spoke with Ty Wilson from the UK support team, and he suggested that we need to track the incident, rather than the success/failure status of the individual check. Unfortunately, while that data does exist in new relic’s database, it isn’t publicly exposed, though he says that’s in the pipeline.

In the mean time we’re working on a system to export synthetic check data so that we can aggregate it.

Ty has recommended that we add a web hook notification channel to the alert policy, which will output to a custom table in the NRDB. We can then use a Synthetic Monitor as, effectively, a piece of arbitrary server less code rather than an actual synthetic check, to extract and aggregate the data.

We’re going to work on something similar to this, although not using his exact method.

1 Like

Thanks for that info! I think Ty is on to something there in tracking incidents. I would caution you that by default Alert preferences are such that you get one incident per policy. So, if one synthetics monitor fails it will trigger an incident. If then another monitor fails, it will be rolled up into that existing incident.

That is, until the first incident is closed. Once it is, any subsequent failure will create the next incident. And so that cycle will continue.

I would suggest reviewing the Incident Preferences such that you can get incidents for all failures: