We’ve experienced an unexpected behavior that ultimately had a sev1 impact in our mission critical workloads.
We have the usual syntethics monitoring that pings our health check method for our core service and notifies it through the existing Slack/Pagerduty integrations.
The issue was that when an infrastructure issue happened (crashing our app); the syntethics monitoring alerted as expected that our health check was giving a non 200 code, and the incident was created
but associated notifications were triggered 40 minutes afterwards…
Since we had several alerts linked to our workloads, a domino effect happened at the time (not only the health check alert was triggered, but several other ones, like no db connections, 0 request going in, etc.), but again all the associated notifications were triggered +40 minutes afterwards.
if it weren’t for our support team that actively checked our site and reported the workload was not handling we would have a +40m downtime.
How much use of is to have a monitoring tool, that won’t alert us when something goes wrong instantly, were one of the question that came from this incident’s postmortem… And is a hard question for me to answer.
Link to one of the incidents that failed to notify in a timely manner for NewRelic’s support team to look into: