AWS ECS alert on runningCount fires incorrectly due to polling window

Observed issue

I’ve created an alert on the running count of an ECS service, and it generates alerts unexpectedly.

The alert NRQL definition is:

SELECT max(provider.runningCount) as 'running API count' 
from ComputeSample 
where displayName = 'ecs-service-name'

The evaluation offset is 5 minutes, the alert condition is to alert if the result is at 0 for more than 10 minutes.

The alert fires randomly depending on the window it queries NewRelic; sometimes it will work, sometimes it won’t.

Expected behaviour

The alert should only fire when the actual count drops to zero.

Actual behaviour

The actual query used is:

SELECT max(provider.runningCount) as 'running API count' 
from ComputeSample where displayName = 'ecs-service-name' 
SINCE 5 mins ago
UNTIL 4 mins ago

The 5 minute setting is from the Evaluation offset, 4 minutes is EvalOffset - 1, as is explained in the alerting documentation:

Every minute, we evaluate the NRQL query in one-minute time windows

The issue is that NewRelic polls AWS for values every 5 minutes, so there is only a value for the query to return in 1 out of every 5 minute-long buckets. So the time window may or may not include a value for the provider.runningCount as the time window is fixed.

I don’t know how to work around this limitation. Any ideas?

Hello @james.telfer,

Would you please send us a link to the alert condition in question so that we can take a closer look at it? ( Only New Relic employees will be able to access this link).

@zahrasiddiqa thanks for your reply! This is the alert condition in question.



Hey @james.telfer, Thank you for sending that on.

It looks like the Incident 86899579 was opened for this alert condition.

Having queried for it in Insights, it looks like the result of the query was below 1 for 10 minutes as you can see here. The alert cannot close automatically, as the inverse of the threshold did not occur. The alert can only close automatically if the result of the query is above 1 continuously for 10 minutes. Since this does not occur, the alert is left open.

Further, the general practice is to actually set the evaluation offset to at least 15 minutes on a NRQL alert condition that queries cloud integration data.

Hope that helps :slight_smile:

1 Like

Thanks @zahrasiddiqa for your response. I’m afraid I don’t understand.

It’s not possible for a continuous 10 minutes to report the status back a value of ‘1’ because Alerts queries in 1 minute windows and the AWS polling is every 5 minutes. So for 4 out of those 5 minutes we’re going to get a 0.

The alert initially triggered on nothing: the service did not go down. It just queried in the wrong timeframe, then never reset.

Setting the evaluation offset from 5 to 15 minutes changes the query from SINCE 5 mins ago UNTIL 4 mins ago to SINCE 15 mins ago UNTIL 14 mins ago. I’m not sure why this will make a difference, since the 1 minute window is not guaranteed to overlap with the event published from the NewRelic poll of AWS.

Hey @james.telfer,

Apologies for the confusion. Let me clarify, the evaluation offset should be increased to account for latency and make sure we are not evaluating incomplete data

By raising the Evaluation Offset setting, you will give the alert condition a chance to wait for the data to show up before trying to evaluate it.

One of our support engineers created a post in our Community explaining how data latency works with NRQL alerting, and how you can best utilise the evaluation offset to successfully receive alert violations in a timely manner:

Relic Solution: Better Latent Than Never -- How Data Latency Affects NRQL Alert Conditions

You could also use Infrastructure alerts with Cloud Integration data to account for the latency.

Hope that helps :slight_smile:

1 Like