The first thing to mention here is that monitoring a job that runs once a day is a problematic use case for Alerts. We look at data eery minute and the best signals to monitor will be reporting data every minute. If the job only has one data point every 24 hours then it can be challenging to write an Alerts condition to monitor it. Not impossible but I would be remiss if I didn’t mention this from the very beginning.
The condition evaluates the query every minute, so if you care about every time the value falls below the threshold, you can have it be a much shorter duration, even as low as one minute. Think of this as a sliding window that looks at all of the data within that window as it moves through time. If you get a data point below your threshold, and you’ve set the condition to violate every time it dips below the threshold, then you only need to look at one minute at a time.
This duration also effects how long it takes a violation to close. You can think of closing criteria as the logical negation for the opening criteria. If you open when you see any data point that violates within the duration, then you must see no data points that violate within the duration. A key fact here is that if we get no data points at all, we won’t know what’s happening.
Which brings me to another point that I think we need to discuss. It is going to work best if you report the count of failures, rather than reporting the number of successes. The most robust way to use New Relic NRQL Alerts to notify you of failures would be to have a service that is reporting at least once a minute on the status of your cron job. It could be as simple as a bash script that greps the cron logs and reports a count of the failures it sees. This lets you configure your threshold to look for the presence of something rather than the absence of it. This will give you a stronger reassurance that you’re getting notified for the specific failure scenario you’re testing for, and not for some transient networking outage or other issue that might cause the number of successes to dip below your expected value.
I hope that this was helpful, I’m happy to discuss more ideas and best practices if you’d like any clarification.