Observed issue
I’ve created an alert on the running count of an ECS service, and it generates alerts unexpectedly.
The alert NRQL definition is:
SELECT max(provider.runningCount) as 'running API count'
from ComputeSample
where displayName = 'ecs-service-name'
The evaluation offset is 5 minutes, the alert condition is to alert if the result is at 0 for more than 10 minutes.
The alert fires randomly depending on the window it queries NewRelic; sometimes it will work, sometimes it won’t.
Expected behaviour
The alert should only fire when the actual count drops to zero.
Actual behaviour
The actual query used is:
SELECT max(provider.runningCount) as 'running API count'
from ComputeSample where displayName = 'ecs-service-name'
SINCE 5 mins ago
UNTIL 4 mins ago
The 5 minute setting is from the Evaluation offset, 4 minutes is EvalOffset - 1, as is explained in the alerting documentation:
Every minute, we evaluate the NRQL query in one-minute time windows
The issue is that NewRelic polls AWS for values every 5 minutes, so there is only a value for the query to return in 1 out of every 5 minute-long buckets. So the time window may or may not include a value for the provider.runningCount
as the time window is fixed.
I don’t know how to work around this limitation. Any ideas?