We have static NRQL based alert conditions setup for error rate (% 4xx and 5xx response codes). The condition looks something like
4xx response code % > 3 % for at least 5 minutes.
Similar, condition was set for 5xx response codes as well. The condition was set to evaluate as
single_value instead of
sum . However, we observed few instances where number of 5xx response codes was pretty significant, but we did not receive any alerts. Based on the assessment we did, it seemed like during the time period when the 5xx error rate was high, probably there were single minute windows in between where the error rate was less than 3%. Since, there weren’t 5 consecutive minutes where error rate was above 3%, the alert incident did not fire.
However, the overall error rate if calculated over 5 minute window instead of five, one minute window, was well above 3%. So, we were evaluating sum and it seems that changing it to sum, literally sums up the result of five, one minute window instead of evaluating it over a single window of 5 minutes, thus result in a lot of unnecessary alert incidents. Is this kind of behavior expected? Is there a way to evaluate over a single 5 minute window instead of 5, one minute windows? What would be the best alternative to it?
Here is the sample NRQL that we are leveraging:
SELECT percentage(count(*), WHERE httpResponseCode LIKE '4%') FROM Transaction WHERE appName like '<appName>'