Creating one alert condition for multiple Synthetics

Hello!

I would like to create an alert condition that meets the following criteria:

  • Opens one issue per incident
  • Can open an issue for multiple monitors
  • Will automatically close an issue when a single monitor resolves itself

Unfortunately, I’m having some trouble. The primary thing I’m trying to avoid is having to create an alert condition per synthetic monitor, but I’m not seeing how that is possible.

Here’s what I have:

The permalink: https://onenr.io/0yw4pOqPLw3

The NRQL query:

SELECT count(result) FROM SyntheticCheck WHERE result = ‘FAILED’ and tags.environment = ‘qa’ FACET monitorName

The rest of the settings:

  • Condition threshold: critical above or equals 1 at least once in 10 minutes
  • Window duration: 1 minute
  • Streaming method: Event flow
  • Delay: 0 minutes
  • Gap-filling strategy: custom static value of 0

The behavior I’m seeing is that a critical issue will be opened when a single monitor fails, but that issue will remain open if other issues are also open even if the synthetic that opened the original issue is no longer failing.

Is there something I’m missing, or do I need to go the route of having a single alert condition per synthetic monitor?

Thanks!

Hi @Andrew.Hill

Since you are using a NRQL condition, you can target multiple monitors at the same time (up to 5000 discrete Synthetic monitors at condition creation!). Each will be tracked separate from all your other Synthetic monitors, so each can open and close incidents on its own.

You mentioned that you’re using

SELECT count(result) FROM SyntheticCheck WHERE result = ‘FAILED’ and tags.environment = ‘qa’ FACET monitorName

This query, when used in an alert condition, will return a result of 1 if a monitor fails, and a result of NULL if a monitor does not fail. Why no 0 value? Because of the query order of operations, which you can read about at this link.

Since your query is not returning 0 values when a monitor stops failing, the incident can’t close on its own (NULL can’t be evaluated numerically to determine if an incident should be closed). In order to get incidents to close, you will need to use a Loss of Signal setting to “close all open violations.” Looking at your condition, I see you have this already, but I’m not sure when you added it – it should work going forward to close your incidents in a timely manner.

One other piece of advice I would give is to use Event Timer as a streaming aggregation method. This method works better when you have intermittent and inconsistent data points, like Synthetic failures. If you’d like to learn more about how the streaming aggregation methods work, take a look at this document.

I encourage you to test these new settings out. If you have Event Timer set and a Loss of Signal configured to close your incident 10 minutes after the signal is lost, but you’re still getting incidents that won’t close, let us know and we can get a support engineer involved right from this thread.

3 Likes

Thank you for the quick and detailed response!

I did indeed add the Loss of Signal setting after I posted this question, but I had stepped away for the day so I didn’t have a chance to validate it was doing what we expected. It does appear to be doing what we want now.

I’ll take a look at using the Event Timer streaming aggregation method as well. Thanks for the reference documentation!

1 Like