Relic Solution: Alert on X Synthetics Failures in Y Minutes

In the past, the Synthetics alerting mechanism was limited. You would get an alert if your monitor failed from a single location. When there were network disruptions between a location and your application, this could cause “flapping” on the alert, where it is raised and cleared multiple times. Now, through the use of NRQL Alerts, you can create a scenario where an Alert is only sent after a certain number of failures.

Requirement: NRQL Alerts

For this to work you need access to NRQL Alerts which is available with your Alerts subscription. To get started, go into your Alerts Policy and configure a new condition of NRQL.

Query Settings

Since we are going to look at Synthetics failures, you want a query like this:

You should see a UI pop up that indicates if you have recently had any failures. My monitor just recently had a failure so I can see the blip on the graph right after 01:30 PM.

Threshold Settings

Here is where we can supply the “X” failures and “Y” minutes. This is what you need to set:

  • You must change the drop down to say “sum of query results”. This is so Alerts will keep adding +1 every time you get a failure. And you must change the threshold dropdown to “above” since we expect to normally have 0 failures.
  • Lastly you put in your value of “X” failures (3 in my example) and “Y” minutes (15 minutes in my example).

image

Lastly you need to put in the name for your condition. This will be used in the Alert Notifications.

Notification Result

In my testing the alert notification looked like this. Note that the name “Multi Location failed 3 times in 15 minutes” was the name I put in for my condition.

When you click on View Incident Details you see a screen like this that shows the SUM of the number of failures:

And if you click on that button that says “Go to SyntheticCheck Overview” it will take you to a query. Note that this does not show the SUM, but shows the TIMESERIES of the individual errors.

If you want the list of the exact ERROR messages, you could change your query:

Or you could navigate to a dashboard that has details on the recent failures (I added a filter for my specific monitor that is going haywire).

Some things to consider:

  • Consider how frequently your monitor runs and the # of locations compared with the “Y” minutes. If your monitor runs every 5 minutes from just 1 location, you would only get a maximum 1 failure per 5 minutes. Do the math really quickly when setting this up, and test, test, test!
  • Also remember to change the dropdown to “sum of query results”

What else?
NRQL queries are brand new, and we know our customers are brilliant at using our products in interesting ways. What have you discovered you can do with NRQL Alerts and Synthetics?

16 Likes

Fantastic share! Exactly what I was after!

@kahrens Thank you so much for posting this write up! We’ll give it a try and will let you know if it improves our “flapping” synthetic alerts. It would still be nice to have these thresholds right on the Synthetic Alerts. But this is a great option in the meantime.

Thanks!!!

1 Like

Please note that this does not ensure failures from multiple monitoring locations before alerting. Only that there are multiple failures from any location. It’s not well integrated, but a workaround using uniqueCount can be found on this page: https://discuss.newrelic.com/t/synthetics-pings-failed-from-one-location-but-were-successful-from-others-feature-idea/41535/27

This is really a long awaited solution, but to be honest I’m struggling to understand it and my tests have not been successfull yet.

My doubts: If the monitor runs every 5 mins, then how could it fail more than 3 times within 15 mins? I would expect that the treshold should be above 2. Or where goes my reasoning wrong?

I tried to set up a monitor and alert condition as explained here, but it never fires. I tried to test for Success and a reverse condition, but this also did never fire.

The monitor is here: https://synthetics.newrelic.com/accounts/1536085/monitors/6c188dc6-09c0-45f6-adcb-ad9d6ec3acb4 and the alert policy is here: https://alerts.newrelic.com/accounts/1536085/policies/380107

What did I miss?

Kind regards
Wolfgang

Hey @wkiNewRelic - I’m hoping you got this figured out by now, sorry this post was missed.

Essentially, if you have a monitor frequency of 5 minutes, and select 5 monitoring locations, then your monitor will run (on average, exact timing can’t be guaranteed) once per minute. Therefore within a 15 minute period you can have up to 15 monitor runs.

As Ken posted in the original solution here, when deciding what condition thresholds work for you, you should really consider the running frequency of the monitor, and the number of locations that monitor is running from.

Hi Ryan, thanks for asking. Yes, I gotit running in the meantime, the logic was clear from the beginning. But I had to make one slight change: I had to add one minute to the time period, I made it 16 instead of 15. This lead to the expected result.

Kind regards
Wolfgang

1 Like

Glad you got it working - thanks for confirming :smiley:

Hello Explorers Hub!

If you’re creating a NRQL alert condition to cover your Synthetics monitors, it may also help to read the post about creating well-behaved NRQL alert conditions.

Happy alerting!

1 Like

Hi folks! I wanted to share an update that Multi-location Synthetics Failure Alerts are now GA

Check out our docs here:

https://docs.newrelic.com/docs/alerts/new-relic-alerts/defining-conditions/multi-location-synthetics-alert-conditions

Thanks for sharing this! It will be super helpful.

Is there a way to set up this one condition for use across all our synthetic monitors? We’re monitoring 50 pages, and setting up 50 alert queries/conditions seems excessive.

Can we leave out the monitorName in the query, but still have the alert tell us which monitor went down somehow?

Edit: For anyone else wondering this, as a stopgap, you can add a dashboard in Insights using something like the following query:

SELECT monitorName, error FROM SyntheticCheck WHERE result ='FAILED'

That way you’ll at least have a place to see a list of the recent errors. The alert notification still won’t tell you which monitor caused it, though :frowning:

@rtuan - you can add a facet in to the query!

Something like:

SELECT count(result) FROM SyntheticCheck WHERE result = 'FAILED' FACET monitorName

This will trigger a violation for any monitor that breaches the threshold, and new violations will be created for other monitors that fail too.

1 Like