Relic Solution: Creating Well-Behaved Faceted NRQL Alert Conditions

Hello there,

I attempted to follow the process outlined here to recreate this sort of test and I’m not seeing consistent results. My scenario:

However what I’m seeing is that the results sometimes jump to 2 and fall to 0 resulting in false positives. This is not ideal, did I misunderstand how this is supposed to work?

Sorry, This was the wrong account to post the question with. This is my correct account, but same questions.

Hi @adrian.stratienco

I took a look at the monitor you linked as well as the adri6675 - Synthetic Test alert condition. I think I see what is happening, here.

  1. Synthetics monitors don’t run at a precise cadence. So, if you’re using a single-check-every-5-minutes monitor, sometimes the monitor will run a little before or a little after the 5-minute mark.
  2. Synthetics monitor checks are not instant – depending on many factors (“internet fuzziness”), they can take as long as 3 minutes to run. In fact, a failure can take a full 3 minutes to time out in extreme cases.

In order to take these delays into account, I would recommend using sum of over 8 minutes. This will result in the sum only dropping to 0 if a legitimate failure occurs.

I hope this helps!

1 Like

Thanks that does seem to help.

Just so I understand clearly how to use this technique I have a few more questions.

  1. I noticed that there is a delay between when the synthetic monitor reports a failure and the alert policy triggers an event. Is this because the evaluation period needs to finish before the results can be evaluated?

  2. The NRQL alert graph seems to indicate a success count of 2 often with a single check location running every 5 minutes and a 8 minute evaluation period. With the success check of > 0 is it possible that a success and failure could overlap and essentially “miss” the failure?

Hi @adrian.stratienco

I noticed that there is a delay between when the synthetic monitor reports a failure and the alert policy triggers an event. Is this because the evaluation period needs to finish before the results can be evaluated?

That is exactly right – the evaluation system only looks at 1 single minute at a time, and it’s waiting for that minute to roll around. Once it gets there, it opens a violation. This results in a slight delay, but also results in a much more reliable alert condition, since you can be sure the data (whether passing or failing) has arrived and been slotted into the proper timestamp.

The NRQL alert graph seems to indicate a success count of 2 often with a single check location running every 5 minutes and a 8 minute evaluation period. With the success check of > 0 is it possible that a success and failure could overlap and essentially “miss” the failure?

Although that sounds like it could be a possible edge case (imagine the first monitor starting a tiny bit late and then running long but succeeding, then the second monitor starting early and taking almost no time but failing), the first success will drop off after 8 minutes, leaving the count at 0 for at least a 1-2 minutes before another monitor runs. Since a single minute of 0 count will cause a violation, 8 minutes should be the sweet spot so that you can avoid false positives but also catch a failure when it happens.


NRQL alert conditions are awesome because they are so versatile, but in this particular case it may be less hassle to simply set up a Synthetics alert condition for this one monitor (it looks like there’s only a single facet on the NRQL alert condition). However, I don’t know enough about your use case to make that call, and these changes should result in an alert condition that properly opens a violation when it needs to but remains silent the rest of the time.

Hello,
I wanted to use this method to create a monitor, which checks several URLs and then have one alert condition to check for failures, where the facet would be the failing URL. I tried to check in a negative way as proposed (no failure found; result is < 1 ) or in a positive way (failure found, result is > 0). But if I trigger an error by putting a wrong URL in the script, none of the conditions fire an alert. However, if I check the NRQL in the dataexplorer, they deliver results as expected.

The monitor is: https://synthetics.newrelic.com/accounts/1536085/monitors/5ae848b9-5c45-46f9-9dc5-c994f232f1a2 and the alert conditions are in this policy https://alerts.newrelic.com/accounts/1536085/policies/493852

NRQLs used to check in Insights:

SELECT count(*) FROM SyntheticRequest where monitorName = '1_MonitorTest' and hierarchicalURL NOT LIKE '%insights%' and (responseCode > 400 or responseCode < 0) facet hierarchicalURL since 4 minutes ago

respective

SELECT count(*) FROM SyntheticRequest where monitorName = ‘1_MonitorTest’ and hierarchicalURL NOT LIKE ‘%insights%’ and (responseCode >= 200 and responseCode < 400) facet hierarchicalURL since 4 minutes ago

If you look into this, you may trigger a failure by modifying one of the URLs in the script.

Would really appreciate your help, as it would be a great benefit, if we could go this working.

Hi @wkiNewRelic

I think the problem here is the for at least 5 minutes in the threshold. Since there is usually no data coming from any facet (at least, when I looked at the conditions there was no data coming from any facet), the conditions are seeing a lot of null values. In order to open a violation, 5 consecutive minutes of non-null data needs to breach the threshold.

If you are just testing these conditions for now, try setting the threshold to at least once in 5 minutes and test again. Alternatively, you can configure the monitor you’re using for testing so that it generates at least 5 consecutive minutes of threshold-breaching data (if it doesn’t send a data point during even a single minute, the 5-minute timer will restart).

I hope this helps – keep us posted with your progress!

1 Like

Hi @Fidelicatessen,

thanks for your reply. There was a basic flaw in both conditions. The time threshold in the condition of course depends on the frequency of the monitor runs. Now I aligned this, the monitor runs every 10 minutes and the threshold in the conditions is 11 mins. Also I changed the conditions to

at least once.
But the behaviour is still not fully clear to me. Both conditions fire now, although not completely to my expectation.

The condition CheckFailureNegative, which fires when there is a count < 1 for a successful run, fires very fast. But it has one flaw: It reports actually the URLs which work, and not the one, which fails. Also this has the drawback, that if monitor runs are unregular and the gap is more than 11 mins, then it fires as well.

The condition CheckFailurePositve, which fires when there is count > 0 for a not successful monitor run, fires with a serious delay, which might be due to the 11 mins threshold. But as you explained, it does not close properly, even when all test are successful again. I suspect that it checks for the failing URL, which of course never comes up again, as the means to induce the failure was the corrupt one of the URLs in the script.

So the CheckFailureNegative method is preferrable. But it would mislead the users by showing not the actual failing URL in the alert email.

I documented the tests in the attached file; it contains also links to the alerts etc. Maybe you can have a look into this and help to get the alert notification correct.

Thank You.TestAlertConditions.pdf (27.9 KB)

Hi @wkiNewRelic

This entire process is made simpler by only using a single URL in every monitor. However, I’ve been digging in to your questions here and will try to address them.

It reports actually the URLs which work, and not the one, which fails

I’m unsure why this is happening, but you may want to open a support ticket if it continues (or let me know if you’d like to go that route and I’ll open one for you from this thread, just to deal with this specific problem).

this has the drawback, that if monitor runs are unregular and the gap is more than 11 mins, then it fires as well

The alerts evaluation system really depends on data coming in at least once per minute for evaluation. We can get around that somewhat by using sum of thresholds, but when data comes in late, this method falls down – the most robust solution is to ensure that data is coming in every minute.

Ultimately, from what I can tell, due to the way you’re faceting (with URL addresses), in your testing you’re introducing new facets which immediately fail. This results in a facet that was never tracked (since it was never introduced), and immediately starts reporting NULL values (since it is now failing). What may work better for testing is to simulate a bad HTTP response code in your script for a facet (URL) that is already being tracked and is already producing positive, non-NULL values. Of course this only applies to the CheckFailurePositive alert condition, since it is measuring failures.

I’m also looping in one of my colleagues to this thread, who is a Synthetics expert, in case you have further questions on this.

Hi @Fidelicatessen,

thank you for your explanations. There was definitely a flaw in my tests as I changed an URL to simulate a failure. This introduces an URL, which only has failures, but never a successful run. So I changed the failure simulation by overwriting the statusCode for one specific URL. But I did this only in the custom event, which is created in the script, as this is the eventtype we want to use for analysis anyway. The alert conditions have been modified accordingly (and could be simplified as i did not have to filter out the posts to the insisghts api).

The result is still somewhat astonishing to me.

  1. Condition CheckFailureNegative (means, check that only success statuses are retrieved). This condition alerts properly on the failing URL, but then seems to detect failures for all other URLs in the script as well. After a while it closes correctly although including the wrong URL in the notification. This mechanism would be ok, but I would like to understand, why all URLs in the script are considered as failing and how I could avoid this. (Links are in the attached document)

  2. Condition CheckFailurePositive (means check for a failing request to an URL). This condition reports correctly the failure, but it never closes. If I run the query

SELECT count(*) FROM MonitorUrl where StatusCode > 400 facet MonitoredUrl TIMESERIES 1 minute SINCE ‘2019-08-05 16:00:00’ UNTIL ‘2019-08-05 17:45:00’ with TIMEZONE ‘Europe/Berlin’

for the time period shown in the incident https://alerts.newrelic.com/accounts/1536085/incidents/78786429/violations , I get different results. The incident stopps at about 16:30, while the query returns for the time after this 0 failures (check the JSON result to see this). My understanding has been, that the query of an NRQL condition is executed every minute. Therefore I would expect, that the condition gets eventaully the result of 0 failures as well. Is the assumption not correct?

  1. Why we have more than one URL in the script: Our user departments have to check for availability up to 70 URLs. Until now they did this with separate PING-monitors. But we ended up with several hundred monitors and the same number of alert conditions. This ran out of control and we want to provide the users now one scripted monitor per domain with one accompanying alert condition. This hight number of URLs in one script is also why I would lilke to understand, why we get false positives in the case of the CheckFailureNegative condition. This could mean that we run into too much false failures, which are opened and may take a while to close, that the whole method collapses.

If you think, we could examine this better in a support ticket, you may create one.

Kind regards
WolfgangTestAlertConditions2.pdf (19.8 KB)

The method to test for sucessful monitor runs, has another drawback. If for some reason the time between monitor runs increases as it happened here:

https://synthetics.newrelic.com/accounts/1536085/monitors/5ae848b9-5c45-46f9-9dc5-c994f232f1a2/results?tw[start]=1565154098.567&tw[end]=1565164898.567
where distance between two runs has been 20 minutes instead of the scheduled 10 minutes, then an alert condition checking for existence of success fires a false alert, because it does not find them.

I think it would be an interesting topic for New Relic to work on a solution proposal, how multiple URLs can be monitored without creating a plethora of ping monitors and alert conditions, which are hard to maintain.

Thanks for all this detailed sharing @wkiNewRelic - Is this something that would make sense as an Alerts feature idea?

https://discuss.newrelic.com/c/feature-ideas/feature-ideas-alerts

Hi @hross,

thank you for proposing this. In the meantime I could clarify this topic together with our technical support @cshanks. During one of my test runs there had been an outage at New Relic and corrupted the test results. I can confirm now, that an faceted NRQL alert condition, which checks, that all facets have successful tests, works as designed. But you have to consider the time thresholds, as the synthetic monitors do not run in exact intervals.

It would be worth a lessons learned or solution article, which contains the whole method (scripted monitor plus NRQL alert condition). In a situation where you have to check many URLs and would end up with lots of monitors and conditions, this can reduce maintenance overhead quite a bit. In our case 4 or 5 monitors and conditions will replace about 400 montiors and conditions (which do not even cover all URLS at the moment, because our users refused to create more and more monitors).

Kind regards
Wolfgang

1 Like

Hey @wkiNewRelic

Thanks for the feedback - we spent some time looking at this and considering the solution article you suggested. I wanted to get you an update on those conversations.

Since the way you are utilising Synthetics monitors (to hit multiple URLs with a single monitor) is not something we explicitly support with Synthetics or Alerts, this is not something we’d like to promote by writing an article explaining how to do so.

Additionally, as it’s not something Synthetics is specifically designed to handle, or Alerts is designed to handle, the resulting behaviour is untested and not fully understood.

Instead - we have filed a feature idea here to the Synthetics team - to allow for better handling of the use case you have. Utilising a single monitor for multiple URLs.

2 Likes

Any solution which makes it easier for users to manage a great number of monitors, which belong together, is welcome.

By the way, of course we use labels. Assigning the same label value to all monitors belonging to one appication resultet in 79 monitors. This still means you have to work through 5 pages of monitors.

1 Like

Thanks @wkiNewRelic - All of this feedback is great stuff to pass on to the Synthetics Dev team. :smiley:

Hi @Fidelicatessen,

Interesting article.
The proposed solution is based on sum of query of results and a check that returns 0/1.

I have the same issue with some Nagios OHI scripts that aren’t running every minute. Alerts don’t auto close because of “NULL” values.

The extra complexity I encounter, I want to check on serviceCheck.status 1 (Warning)/2 (Critical)/3 (Unknown). A critical if it’s >=2, a warning if <=1. Is there a way to let this work?

  • sum of query results based on serviceCheck.status -> values between 0 - 6
  • count (serviceCheck.status) -> can’t make the difference between between warning/critical. The only workaround seems to create 2 different NRQL alert conditions .

Is there a possibility to let this work with 1 alert condition?

Hi @stijndd

So long as you’re not using faceted NRQL queries to drive your alert conditions, you won’t run in to the issue with null values showing up in the absence of data.

The threshold you’re describing, though, sounds like the alert condition would be opening violations, warning or critical, no matter what values are being returned by the query. You might be able to make this work with only one condition, but I would, instead, recommend setting up one threshold (the critical one), and using the lack of violations to let you know that the serviceCheck.status values are in the other range (<=1, or not violating the critical threshold).

I hope this helps!

At last review this feature was not accepted to be added to the roadmap. This request has been closed.