We have set up NRQL alerts against GKE containers to monitor pod restarts. If a pod restarts more than 2 times in 10 minute period it should trigger the alert, but this is not happening. This is the NRQL being used:
SELECT max(restartCount)-min(restartCount) as ‘Restarts’ from K8sContainerSample WHERE clusterName = ‘cluster-name’ AND containerName = ‘container-name’ and namespace = ‘our-name-space’ facet podName
Threshold is Static defined: “When the query returns a value above 2 at least once in 10 minutes”
I have a pod that is in a crashloopbackoff so I am observing it restart. From my understanding of the alert this should be the equivalent query to run in insights:
SELECT max(restartCount)-min(restartCount) as ‘Restarts’ from K8sContainerSample WHERE clusterName = ‘clusterops-development-gke’ AND containerName = ‘geronimo-schedule-handler’ and namespace = ‘content-engineering-development-east’ facet podName since 10 minutes ago
If that is the case, when I first start this pod up the query returns 0 for 5 minutes. During that period the pod restarts 6 times. GKE throws in a exponential delay after each restart up to 5 minute delay. Given that, 0 to 5 minutes 6 restarts. 5 to 10 minutes 1 more restart added for a total of 7 restarts at the end of 5 minutes. Newrelic insights reports 0 up to the first 5 minutes and then reports 6. After the next 5 minutes it report 1. That makes sense to me given that after the second 5 minutes the min value is 6 and the max value is 7.
I would expect the alert to trigger, though given that in the first 10 minutes the min should be 0 and the max 6 or 7. Why is this not the case?
The only way I can figure this out is that the 10 minute threshold is not “Since 10 minutes ago until now” but more like between 11:00 a.m. and 11:10 a.m., and then 11:10 a.m to 11:20 a.m. The alert does not check every minute looking back 10 minutes but will only check every 10 minutes looking back in that 10 minute period. If that is the case then I can see how if reporting is happening on 5 minute intervals it is possible that the check could happen when min was 6 and max was 7 for the 10 minute period.
Still, I would think that in the first 10 minute interval a min value of 0 should have been reported unless it completely discarded the first 5 minutes.