Has anyone experienced a problem where a monitor that is setup as a ping is disabled but an alert is still triggered? The monitor had been disabled for 30 minutes prior to the site the monitor was pinging went offline for maintenance but the alert still triggered.
How was the ping synthetic disabled? I have come across occasions where a user has disabled the synthetic from the monitor list. Although the UI indicated the synthetic was disabled, the synthetic was not - refreshing the screen showed the correct status.
This is possibly due to an Ajax request from the UI not completing, but the UI thinks it should change state.
Along with the method you used to disable the monitor (disabled status/monitor-downtime) as @stefan_garnham mentioned, if you provide a link to the monitor, and the unexpected failed check I’d be happy to review it.
Yes! A hundred times, yes!
I created some posts in the past about this:
- Disabled Synthetic site monitor still throws an alert
- Feature Idea: Monitor check status before running or alerting
It created alert fatigue on my team when we have automation disable individual monitors (via the API), perform maintenance on that thing, and then resume monitoring. Each time we hit this issue, I increased the amount of time we sleep before performing the maintenance on the monitored site. Initially it was sleeping for 30 seconds, but it still happened. Then 60 seconds, still happened but less frequently. Now we are sleeping for 80 seconds, and I think we good, but it also means waiting 80 seconds before doing stuff. Multiple 80 seconds by 300+ site maintenance/upgrades that we’ve done last month, and that adds up to a lot of idle time spent waiting, avoid false alerts on a already disabled monitor.
Somewhere in a thread or private support ticket, a NewRelic rep stated that a monitor probe that has been queued up and sent to the probe location, can take time for it to spin up the container that it runs and executes in. 60+ seconds seems like a long time to have a container spin up – which is what led to my feature request linked above – have the probe check back with the master server/DB if it’s in a disabled state or not, before sounding the alarms.