We have a process that starts every 6 hours and runs for a few minutes. We want to be alerted if the process does not start at least once in 6 hours. Unfortunately, we cannot set the threshold duration to be longer than 60 minutes. Is this possible to solve with New Relic? I tried both NRQL and Infrastructure alerts, and still unable to get anywhere with this.
Honestly, we don’t do a great job with very infrequent jobs. One thing that I’ve started doing is leveraging some bash scripting to add information about job success to Insights. I use Jenkins a lot for timed jobs and when we started having some failures I wanted to know more. I added a post-build step that uses curl to send a few pieces of data about the job to Insights for me.
I know that Jenkins is running because I have the Infrastructure agent monitoring that host. I know that the job is configured to run correctly, so I know that will happen. If the job fails, I have a post-build step that has SUCCESS: FALSE and if it succeeds it sends SUCCESS: TRUE. This allows me to create a NRQL Alerts condition that notifies me any time I get a failure.
This is just one example of using a little bash to help out. Without knowing anything else about how your job is running, or what might cause it to fail, I hope this idea helps.
If I’m off track here, please let me know some more about how you’re running the job, what might cause it to fail, etc.
Thanks for the reply. The process doesn’t fail, actually - it’s a job that runs every 6 hours by design and completes quickly. We just need to know if it doesn’t run for some reason. This is on Windows, and we’re a bit limited in what we can do on that server in terms of post-run scripting, etc.
Here’s the query I’m using:
SELECT uniquecount(entityId) FROM ProcessSample WHERE (`entityName` LIKE '%JOBSERVER%' AND `processDisplayName`= 'scheduledJob.exe')
I tried to set it up in different way today: Alert me if the query returns less than
1 for at least N hours. Unfortunately, the longest time period I can select is 2 hours. But the job runs every 6 hours. So ideally I’d like to set it to 7 hours, and have an alert if a run was skipped.
Can I make this a feature request, please? It would be great to increase the timespan that an alert is able to evaluate.
New Relic edit
- I want this, too
- I have more info to share (reply below)
- I have a solution for this
We take feature ideas seriously and our product managers review every one when plotting their roadmaps. However, there is no guarantee this feature will be implemented. This post ensures the idea is put on the table and discussed though. So please vote and share your extra details with our team.
I also need Threshold to be set more than 60 minutes as it was in SERVERS module.
My Backup running almost 120 minutes, and I’am monitoring for High Disk I/O.
In SERVERS module my alarm was setup as: Disk Utilization > 95% for 120 min.
In INFRASTRUCTURE I cannot setup it.
I would really like the ability to set threshold durations for greater than 2 hours. I have a job that runs every 24 hours. I would also like to use New Relic alerts to alert me if the job doesn’t run. having the ability to specify arbitrary thresholds would be phenomenal.
Why is there any limit at all? This does not make sense to me.
Like others we have processes that runs infrequently that we cannot monitor.
From a load perspective it would be better for NR if the schedule could be several hours, days or even weeks.
We are quite a few requesting this now, can we possibly expect some feedback from NR?
Serving the customer base is not complicated in this case. You do not need new functionality, you just need more options in the GUI. I guess it would take less than 10 minutes to fix.
Same boat here, but 2 years later.
Thanks for adding a +1 here @Tim.Walter - got that logged for you internally!
I want to share that, over the course of the next month, we are releasing official support for “loss of signal detection” for NRQL Conditions. Loss of Signal configuration allows you to set an expiration duration, in wall clock time, from the time that we received the last data point. Once that time expires, we will identify that signal as being lost. Once that happens, we will allow for 2 actions:
- close all open violations
- create a new violation (and resultant notification) for “Loss of Signal”
The maximum expiration duration we will support to start with is 48 hours. Therefore, you should be able to use this feature for being alerted if infrequent jobs do not report data within the expected time frame. If you expect a service to report every 6 hours, you can set an expiration duration of 6 hours and 15 minutes (for example), and be notified if that signal is lost.
After we release this, we will reassess this use case to determine if another method is still required.
We should have Loss of Signal Detection completely by end of July 2020, and then proceed to begin the rollout of Gap Filling capabilities.
Product Manager - AI Ops