Create alert for long running Kubernetes cron jobs

Hi I’m trying to create an alert for any cronjobs in our kubernetes cluster that run for longer than a defined period of time e.g 2 hours.
The query seems simple, look for any cronjob pods that still have a status of ‘Running’ within the last 10 minutes and were created before the current time minus 2 hours.
But I’m struggling to get this implemented in new relic.
See my current query below:
SELECT latest(status) as ‘Status’, latest(createdAt) as ‘Created’ from K8sPodSample facet clusterName, namespace, podName, createdAt since 10 minutes ago UNTIL 1 minute ago limit 100 WHERE namespace = ‘production-jobs’ and status = ‘Running’

See attached screenshot of query showing 2 pods that were created on 10th August 2021 that I would like to trigger an alert for.

Any help would be appreciated.

2 Likes

Hi @brett.rutkowski

Welcome to our Explorers Hub!!

That was a really interesting question! I liked to investigate it! :slight_smile:

Please remember: all queries below are examples based on your question and need to be tested and modified to fit each necessity

So, couple of changes that are necessary:

  • The NRQL-based conditions, don’t allow you to include the clauses SINCE and UNTIL

    • So the evaluation period will be defined by the condition itself.
  • You may not want all the details on the FACET clause

    • In my example I just used clusterName and podName but feel free to include any one that you may need.

My findings:

  • How to calculate the time:
    • The event K8sPodSample doesn’t has a field with the “lifetime” of the pod but we can do an approximated estimative using the timestamp of the event.
  • The query that I’ve used was:
SELECT clusterName,  podName , createdAt, (timestamp/1000 - createdAt) as 'lifetime'  from K8sPodSample  WHERE namespace = 'production-jobs' and status = 'Running'

The timestamp seems to be registering the time in milliseconds while the created at in seconds, so I divided the first one by 1000. Example results [here]


The query to get the latest lifetime and to be used in the condition would be:

SELECT latest(timestamp/1000 - createdAt)  from K8sPodSample  WHERE namespace = 'production-jobs' and status = 'Running'  FACET clusterName,  podName

Results [here]


So, now you just can create an NRQL-based condition where the critical threshold would be the time limit, in seconds, that the lifetimes cannot go above.

More information about the NRQL-based conditions [here]


thanks

Rodrigo

Hello! I’m bumping this conversation since I’ve the exact same use-case. I’ve implemented your suggestions for calculating lifetime, but I’m seeing some discrepancies between the calculated lifetime in NewRelic versus what I can query with kubectl. Do you have any idea what degree of accuracy this should have?

Currently the jobs we’re running are rather short - between 10 and 15 seconds, according to kubectl - but we expect them to get longer over time and want to keep an eye on it. The calculated lifetime in NewRelic has been consistently lower - 2 to 10 seconds. There doesn’t seem to be any sort of correlation between actual length and reported length, at least at this scale.

Any ideas to address this? Or is this just going to have a limited accuracy at this scale?

Hey just posting the actual solution we ended up going with and have found this to be working fine for us.

`SELECT count(*) from K8sPodSample facet podName WHERE namespace = 'production-jobs' and status = 'Running' and timestamp-(createdAt*1000) > 7200000`
2 Likes