Synthetics concurrent monitor failure alerting using NRQL Alerting

Background
A common industry standard for Synthetics is to trigger an alert when a monitor fails multiple times in a row across all locations. New Relic Synthetics doesn’t currently provide an option to create a monitor failure alert with an out of the box alerting option.

Multi-location Failure Alerting
Synthetics does have a canned alert for multi-location failures. This triggers an alert when a monitor, or set of monitors, fail from a specific number of locations in a row. More details

Concurrent Monitor Failure Alerting
Although an out of the box alert is not available to trigger an alert when a monitor fails x out of y times, we can accomplish this by using NRQL alerting.

NRQL Alerting Option
Follow the steps to set up a NRQL alert condition that supports concurrent monitor failure:

Create a NRQL alert condition
Follow the steps on our docs to create a NRQL alert in the same account as where the Synthetics monitor is running.

NRQL alert configuration
Configure the NRQL alert to use the following NRQL:

SELECT count(*) FROM SyntheticCheck WHERE monitorName =[Monitor_name] and result = 'FAILED'

Note

  • If you want to have the monitor evaluate failures for single locations add ‘FACET location’ at the end of the NRQL.
  • To have this track monitor failures across all possible locations keep us the NRQL above (no facet).

Threshold configurations
Adjust the “when the” field drop down to be “sum of query results is” and “above”. This causes an alert to be based on when a monitor fails.

Critical values
Enter the number of check failures to occur and the time window before the condition is triggered.

Important considerations
Consider the period at which the monitors run and number of locations. If a monitor runs every minute from 3 locations this equates to 3 checks per minute. Adjusting the # failures and time window will adjust your threshold chart. It’s recommended to adjust this until you see clear spikes/failure for the monitor.

Advanced signal settings
Select “custom static value” = 0 here. This will ensure that the alert will auto resolve when failures no longer occur.

1 Like

I’m trying to do a similar thing that you outlined here. I have an API health check synthetic that fires every 5 minutes. If it fails once, we have it post to a slack channel (that works fine since we don’t need the NRQL query there). The second one I’m having issues with is if the monitor fails twice in a row, we want our on-call to be paged. I set it up to have the sum of query results to be above 1 in a 10 minute timeframe. In the advanced options, I set the aggregation window to be 5 minutes (since the synthetic is on a 5 minute cycle) and the offset evaluation window is set to 1. We then do the same thing you did here where we fill data gaps with “Custom static value” 0. From the chart, the alert fires when I expect, but it never auto resolves even though the chart makes it look like it should. Is this an issue with how I configured it, or could it be a bug in Newrelic?

This is a result of the recent change to streaming alerts for NRQL conditions. This can be addressed by configuring a loss of signal threshold on your alert condition to close all current open violations if you don’t see a result that matches your query within a certain time period. Typically I recommend setting this to the same value as your “at least once in” setting, which would be 10 minutes based on your screenshot.