Clarifying alert thresholds

We have set up NRQL alerts against GKE containers to monitor pod restarts. If a pod restarts more than 2 times in 10 minute period it should trigger the alert, but this is not happening. This is the NRQL being used:

SELECT max(restartCount)-min(restartCount) as ‘Restarts’ from K8sContainerSample WHERE clusterName = ‘cluster-name’ AND containerName = ‘container-name’ and namespace = ‘our-name-space’ facet podName

Threshold is Static defined: “When the query returns a value above 2 at least once in 10 minutes”

I have a pod that is in a crashloopbackoff so I am observing it restart. From my understanding of the alert this should be the equivalent query to run in insights:

SELECT max(restartCount)-min(restartCount) as ‘Restarts’ from K8sContainerSample WHERE clusterName = ‘clusterops-development-gke’ AND containerName = ‘geronimo-schedule-handler’ and namespace = ‘content-engineering-development-east’ facet podName since 10 minutes ago

If that is the case, when I first start this pod up the query returns 0 for 5 minutes. During that period the pod restarts 6 times. GKE throws in a exponential delay after each restart up to 5 minute delay. Given that, 0 to 5 minutes 6 restarts. 5 to 10 minutes 1 more restart added for a total of 7 restarts at the end of 5 minutes. Newrelic insights reports 0 up to the first 5 minutes and then reports 6. After the next 5 minutes it report 1. That makes sense to me given that after the second 5 minutes the min value is 6 and the max value is 7.

I would expect the alert to trigger, though given that in the first 10 minutes the min should be 0 and the max 6 or 7. Why is this not the case?

The only way I can figure this out is that the 10 minute threshold is not “Since 10 minutes ago until now” but more like between 11:00 a.m. and 11:10 a.m., and then 11:10 a.m to 11:20 a.m. The alert does not check every minute looking back 10 minutes but will only check every 10 minutes looking back in that 10 minute period. If that is the case then I can see how if reporting is happening on 5 minute intervals it is possible that the check could happen when min was 6 and max was 7 for the 10 minute period.

Still, I would think that in the first 10 minute interval a min value of 0 should have been reported unless it completely discarded the first 5 minutes.

Hi @dwashko

What might be the disconnect here is that the alerts evaluation system only looks at a single minute at a time. It does not look back 10 minutes. In fact, if you have evaluation offset set at its default setting, you have an implicit SINCE 3 minutes ago UNTIL 2 minutes ago.

Your threshold, When query returns a value above 2 at least once in 10 minutes is looking every minute for a value above 2. Every minute, it is taking max(restartCount) and subtracting min(restartCount) for that minute only. In most cases, this will result in 0, since the max and min values are the same.

I hope this helps to better understand the behavior of this alert condition, but I may be reading your question wrong, and I may be getting some details wrong about the behavior of the condition – if you think this is the case, please include a link to the alert condition so that I can investigate further.

2 Likes

To be clear on something, if the alert is changed to use “Sum of Query Results” as opposed to “Query Returns a Value Of” that would look at the returned values in that 10 minute period and add them together. So in the case here, a 10 minute period, it will look at each of the 10 minute periods, take the value and sum them. For example if two periods returned a value of 2, this would trip the alert where the value is 3 or more in a ten minute period. Is that correct?

Hi @dwashko

Sum of query results are above 2 at least once in 10 minutes would still look at each minute discretely, but it would keep a rolling 10-minute total of the values from each discrete minute. So if, in adding up the results, the rolling sum ever went as high as 2.0001, a violation would be opened.

I hope this makes sense, but if I can clarify further let me know.

1 Like

Thanks, your information cleared up the misunderstandings I had.

1 Like

What is the best way to set up an alert for the use case describe above (pod restarts) where we want to alert when the change in value within a time window exceeds a specific threshold?

4 Likes