Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

I need robust cron job monitoring

alerts
monitoring

#1

Hey guys,

At The Business Backer, we currently have no monitoring/alerting in place for our cron jobs. I need some help identifying how I might accomplish the following with our cron jobs.

Here’s what we need to know from a monitoring perspective:

  1. Did the job start?
  2. What time did it start?
  3. Did it start at the expected time?
  4. Did the job complete?
  5. What time did the job complete?
  6. How long did the job take to run?
  7. If it did not complete, what was there an exception thrown? failure?

Here’s what we need to be alerted on:

  1. The job ran, but didn’t run at the scheduled time
  2. The job didn’t run at all
  3. The job took longer than expected to finish
  4. The job failed

Email should be sent out and HipChat notification would be great. Please let me know how this can be accomplished in NewRelic. Thanks!

Evan


Is it possible to create new relic alerts for the corn jobs
#2

Hi @emathews,

A good way to achieve both your monitoring and alerting requirements would be to send up information about your cron jobs to Insights as a custom events. Insights events consist of a JSON array of key-value pairs called attributes. So you could include all the information you are interested in as attributes of this event. For example:

{
   "eventType":"CronJob",
   "timestamp":1508862525,
   "expectedStartTime":1508862525,
   "completed":"true",
   "duration":161,
   "completedTime":1508862686,
   "exception":"string"
}

You can see the details on how to insert custom events using the Insights API on our docs site: https://docs.newrelic.com/docs/insights/insights-data-sources/custom-data/insert-custom-events-insights-api

You can then use NRQL alerting to create alert condition based on this data. For example to alert on the length of time the job took to complete you could simply alert on the value of duration:

You can also use NRQL alerting to notify you if the job failed to run. If the job is expected to run once every hour then you can simply count up the events in the past hour and send a notification if this number is less than 1:

To alert if the job didn’t run at the scheduled time you could use a query like

SELECT count(*) FROM CronJob WHERE timestamp != expectedStartTime
```

You can see instructions on setting up email and HipChat notification channels here: https://docs.newrelic.com/docs/alerts/new-relic-alerts/managing-notification-channels/notification-channels-controlling-where-send-alerts#channel-types

#3

There’s another service that specializes in this, Cronitor – might be easier if you’re not already using insights


#4

Hi,

With the NRQL alert, it seems that the configurable threshold can only be expressed in “minutes” and that 120 minutes is the maximum.

How can we use this solution to monitor a cron that runs once a day for example?

Thanks,
Manu


#5

The first thing to mention here is that monitoring a job that runs once a day is a problematic use case for Alerts. We look at data eery minute and the best signals to monitor will be reporting data every minute. If the job only has one data point every 24 hours then it can be challenging to write an Alerts condition to monitor it. Not impossible but I would be remiss if I didn’t mention this from the very beginning.

The condition evaluates the query every minute, so if you care about every time the value falls below the threshold, you can have it be a much shorter duration, even as low as one minute. Think of this as a sliding window that looks at all of the data within that window as it moves through time. If you get a data point below your threshold, and you’ve set the condition to violate every time it dips below the threshold, then you only need to look at one minute at a time.

This duration also effects how long it takes a violation to close. You can think of closing criteria as the logical negation for the opening criteria. If you open when you see any data point that violates within the duration, then you must see no data points that violate within the duration. A key fact here is that if we get no data points at all, we won’t know what’s happening.

Which brings me to another point that I think we need to discuss. It is going to work best if you report the count of failures, rather than reporting the number of successes. The most robust way to use New Relic NRQL Alerts to notify you of failures would be to have a service that is reporting at least once a minute on the status of your cron job. It could be as simple as a bash script that greps the cron logs and reports a count of the failures it sees. This lets you configure your threshold to look for the presence of something rather than the absence of it. This will give you a stronger reassurance that you’re getting notified for the specific failure scenario you’re testing for, and not for some transient networking outage or other issue that might cause the number of successes to dip below your expected value.

I hope that this was helpful, I’m happy to discuss more ideas and best practices if you’d like any clarification.


#6

Hi,

Great to see such a fast reply!

Thanks for giving some context and explaining how to implement a workaround, makes all sense even tho I feel the approach is a bit counter intuitive. It’s looking for and accepting that a negative condition is actually a positive result.

On a side question, is there anything in NR roadmap to target cronjob more specifically (and managed them in a bit more “natural” way?

Cheers, Manu


#7

Nothing comes to mind off the top of my head but @NateHeinrich would be the definitive source of truth for questions about our roadmap.

I totally understand how you feel about the mental model, too. There are a lot of similar scenarios where the way we naturally think of a problem creates challenges for alerting. There’s a whole meta-discussion woven throughout the forum about Server Not Reporting/Host Not Reporting, process heartbeat/not running, etc where the concept is stupidly straight forward but the implementation is very challenging. We spend a lot of time thinking and talking and engineering around the problem of “how do we know if the thing stopped working or if this is just late data?”

The challenges around that are what prompted my suggestion. If every minute you’re reporting “no failures here!” then we can know that any time you do report a failure, it’s because it definitely happened. If, instead, you’re reporting successes and we get a minute where you stop reporting successes, it’s hard to know if that’s because the job stopped running, or if the host it was running on died, or if there was some sort of network disruption between the host and New Relic, or if the data is being buffered somewhere (which could be on the host if the agent is having problems connecting, or even within New Relic if we’re having something go wrong).

Whatever solution you decide to implement, I would encourage you to try and report data at least once per minute. If you decide to write a script to parse your cron job logs and use the custom event inserter, I would encourage you to report data a couple of times per minute so that you can be certain your data is timely. If you run into problems don’t hesitate to post here in the forum!


#8

Hi parrot,

I expect a better response from New Relic. The NR Alerting system is very good but need some enhancements. It would be so simple if you develop the capacity to schedule the verification of that condition. It would solve the problem.