Critical New Relic Notification Bug (+40minutes notification delay!)

Hello all,

We’ve experienced an unexpected behavior that ultimately had a sev1 impact in our mission critical workloads.

We have the usual syntethics monitoring that pings our health check method for our core service and notifies it through the existing Slack/Pagerduty integrations.

The issue was that when an infrastructure issue happened (crashing our app); the syntethics monitoring alerted as expected that our health check was giving a non 200 code, and the incident was created

but associated notifications were triggered 40 minutes afterwards

Since we had several alerts linked to our workloads, a domino effect happened at the time (not only the health check alert was triggered, but several other ones, like no db connections, 0 request going in, etc.), but again all the associated notifications were triggered +40 minutes afterwards.

if it weren’t for our support team that actively checked our site and reported the workload was not handling we would have a +40m downtime.

How much use of is to have a monitoring tool, that won’t alert us when something goes wrong instantly, were one of the question that came from this incident’s postmortem… And is a hard question for me to answer.

Link to one of the incidents that failed to notify in a timely manner for NewRelic’s support team to look into:
Incident link

Thank you

Hi @julian.gonzalez,

Thanks for bringing this issue to our attention; I definitely appreciate that outages impact your ability to monitor your workloads.

It sounds like you’re referring to an incident from yesterday, so I wanted to share some information on this:

Between 00:00 and 06:54 UTC, some customers experienced UI inconsistencies, missed alert notifications, and gaps in charts and graphs. Data will not be recovered. Notifications were impacted for some customers between 00:34 and 06:54 UTC. We’ve resolved the issue with UI and alerts for multiple products in the US region. Services have returned to their normal operation and you can review our status page here: New Relic Status - Alerts & UI Irregularities for US Region.

If you’d like you can also subscribe to our status page for timely updates on New Relic’s platform.

Hey @tpaul1, I appreciate the follow up on this.

The issue happened on October 8th, at 1pm (GMT-3).

I’ve checked in your status history and at that time you were having an incident regarding this same topic (New Relic Status - Infra Alerts/UI Delays for US Region)

We’ll subscribe to your status page updates to preventively be on the look out for this kind of situations.

Thanks you

Hi @julian.gonzalez,

You are most welcome, I am happy to help :slightly_smiling_face:

Thank you for the update. Please let me know if I can help with anything else!