Raise an alert when more than 1 location of a monitor fails for over 15 minutes. Please advise. 24%20PM|690x463
In order to help with this I’ll need to be able to see the query. Will you please post a link to the alert condition you’re referring to? Don’t worry, only people who have access your account will be able to use the link.
Sure, this is the query- https://one.newrelic.com/launcher/nrai.launcher?pane=eyJuZXJkbGV0SWQiOiJjb25kaXRpb24tYnVpbGRlci11aS5jb25kaXRpb24tZWRpdCIsIm5hdiI6IlBvbGljaWVzIiwicG9saWN5SWQiOjEwMzc5ODYsImNvbmRpdGlvbklkIjoxNTk0NzM4OH0=&sidebars=eyJuZXJkbGV0SWQiOiJucmFpLm5hdmlnYXRpb24tYmFyIiwibmF2IjoiUG9saWNpZXMifQ==&platform[accountId]=1985766 . Thanks a lot !
Thank you for sending that over! See my suggestions below, but also scroll to the bottom of this post if you’d like to try an entirely different (and probably much simpler) solution.
I notice in that alert condition you have Evaluation Offset set to only 1 minute. Synthetics failures can take as long as 3 minutes to fail (there’s a hard-coded 3 minute timeout) and usually take longer than 1 minute to run, especially if they’re failing. A 1-minute Evaluation Offset is one of the reasons I think this isn’t working correctly.
I checked our backend to see if we were missing any evaluation for events on this condition, and over the past 2 weeks we’ve missed many events for being too late. This article goes into some detail about what latent data is and how it can affect NRQL alert conditions. I would recommend setting this to 5 minutes.
Synthetic check frequency
Another problem is that you’re evaluating once per minute, but your threshold is set to
Query returns a value above 1 for at least 15 minutes. The Synthetic monitor you’re targeting in this case only runs a check once every 5 minutes (cycling between 3 different locations so that 3 checks are run every 15 minutes). This means that there will never be 15 consecutive minutes where the returned value will remain above 1. In fact, since the query is running once per minute but the Synthetic Checks happen only once every 5 minutes or so, the query can’t ever return a value above 1 in a single minute.
There are several solutions, here – you could raise your Aggregation Window value to 15 minutes, so that all 3 checks are being considered every time the system evaluates the alert condition. If you want to know whether 2 or more locations failed over the course of 15 minutes, this is probably your best bet. Alternatively, what might work is using
Sum of query results in your threshold, which will keep a rolling total of query results over a period of time (say, 15 minutes).
Getting violations to close
Finally, when things get back to normal and you’re having successes again, your violation will not close on its own. The reason for this is detailed in this article. The solution is to configure Loss of Signal (LoS) in this condition, so that after a certain amount of
NULL values (say 15 minutes), open violations will close.
- Set Evaluation Offset higher. Synthetics failures can take as long as 5 minutes to be ingested, depending on whether they time out and if there are any transient network issues. Setting this higher will result in a more reliable alert condition. I recommend 5 minutes, although 4 may work.
- Use a larger Aggregation Window or a
Sum of query resultsthreshold, since your Synthetics monitor is not running every minute (note that if you raise Aggregation Window, you can use a smaller value for Evaluation Offset but I would not set this any lower than 2 so that you are sure to catch all of the results every time an evaluation happens).
- Set up a LoS so that any open violations can close on their own once the problem is addressed.
An alternative (and probably easier) method
It sounds like you want to raise a violation when your Synthetic monitor fails from more than one single location. We have an out-of-the-box condition type to handle exactly that use-case! Take a look at Multi-location Synthetic monitoring alert conditions, which are purpose-built for exactly this use-case. As an added benefit, they require MUCH less micromanaging than NRQL alert conditions, so it should be much easier to set one up and have it just work, right from the start.
I hope my suggestions help. Let us know how it goes!
First of all, thank you so much for the elaborate suggestion. I did look into the Multi location Synthetic monitoring alert condition. I want an alert to be raised if 2 or more than 2 locations fail for over 15 minutes . It doesn’t allow me to specify the X minutes as a part of the alert condition.
Hopefully you were able to get your alert condition opening violations properly.
I want an alert to be raised if 2 or more than 2 locations fail for over 15 minutes.
Each location only gets pinged once every 15 minutes (the way the Synthetics monitor is now configured), so one single failure from two separate locations should be enough to set it off – this is the behavior you get with Multi-Location Synthetics conditions, only with much less micromanagement.