Sum of Query Results - Clarification

We have static NRQL based alert conditions setup for error rate (% 4xx and 5xx response codes). The condition looks something like

4xx response code % > 3 % for at least 5 minutes.

Similar, condition was set for 5xx response codes as well. The condition was set to evaluate as single_value instead of sum . However, we observed few instances where number of 5xx response codes was pretty significant, but we did not receive any alerts. Based on the assessment we did, it seemed like during the time period when the 5xx error rate was high, probably there were single minute windows in between where the error rate was less than 3%. Since, there weren’t 5 consecutive minutes where error rate was above 3%, the alert incident did not fire.

However, the overall error rate if calculated over 5 minute window instead of five, one minute window, was well above 3%. So, we were evaluating sum and it seems that changing it to sum, literally sums up the result of five, one minute window instead of evaluating it over a single window of 5 minutes, thus result in a lot of unnecessary alert incidents. Is this kind of behavior expected? Is there a way to evaluate over a single 5 minute window instead of 5, one minute windows? What would be the best alternative to it?

Here is the sample NRQL that we are leveraging:
SELECT percentage(count(*), WHERE httpResponseCode LIKE '4%') FROM Transaction WHERE appName like '<appName>'

Helpful Resources:

Appreciate any help on this. We are kind of missing on some important alerts for some of our services which went down recently.

You could also use Sum of query results, and just increase the threshold. If the error rate is 3% for 5 minutes, the sum would be 15, so make that your threshold.

Hi @siddhant.agarwal

I just wanted to point out that you can now set your aggregation window to values other than 1 minute. Take a look at the Advanced Signal Settings section in this documentation.

If you need a 5-minute aggregation window, you can configure your NRQL alert condition that way! Just keep in mind that best practices are to set your threshold window to a multiple of your aggregation window (so if you use a 5-minute aggregation window, your threshold window should be a multiple of 5).

2 Likes

So, if I understand it correctly, if I want the evaluation to be done over a single window of 5 minutes with a threshold duration set to 5 minutes then here is how the condition should look like:

  • So, New Relic will NRQL once after collecting data over a window of 5 minutes, wait for an additional offset of 3 minutes to allow the late data to come in and then alert based on the threshold set?

  • What does 15 minutes in the image below the Offset evaluation signifies?

1 Like

Hi @siddhant.agarwal

You have this set up correctly.

So, New Relic will NRQL once after collecting data over a window of 5 minutes, wait for an additional offset of 3 minutes to allow the late data to come in and then alert based on the threshold set?

What does 15 minutes in the image below the Offset evaluation signifies?

That is mostly correct, but the offset will be 3 Aggregation Windows, not 3 minutes. It’s natural to think of evaluation offset in terms of minutes, since New Relic never let you set Aggregation Windows before (they were hard-coded to always be 1 minute). Now, however, Evaluation Offset is set in terms of Aggregation Windows. If your Aggregation Window is set to 1 minute, then an Evaluation Offset of 3 would result in a 3-minute offset. However, your Aggregation Window is set to 5 minutes, so that Evaluation Offset would actually be 15 minutes (3 x 5).

This means that the query that is actually being evaluated will have a SINCE 15 minutes ago UNTIL 10 minutes ago on the end of it.

So, this sounds good: SINCE 15 minutes ago UNTIL 10 minutes ago. Just to confirm once again, with the above setting, the NRQL will execute only once, right since the aggregation window is 5 minutes. So that means the query that effectively runs is:

SELECT … FROM Transaction SINCE 15 minutes ago UNTIL 10 minutes ago

and this executes only once since the threshold duration is also set to 5 minutes, correct? Had the aggregation window was set at 1 minute, it would have executed NRQL 5 times and the summed up the results and then evaluated the sum value against threshold?

this executes only once since the threshold duration is also set to 5 minutes, correct?

The aggregation window being set to 5 minutes means the query is running once every 5 minutes. The aggregation window setting determines how often the query will run, thus aggregating data into a single point that can be evaluated against the threshold.

Had the aggregation window was set at 1 minute, it would have executed NRQL 5 times and the summed up the results and then evaluated the sum value against threshold?

So long as the threshold is set to Sum of query results..., then yes, this is exactly correct. Actually, every time it ran the query it would have calculated a sum of the previous 5 minutes at that moment, using a sliding 5-minute window, and evaluated that sum against the threshold.