Is there a way to separate real alerts from false ones that are caused by NR issues?

There have recently been several NR incidents that caused false alerts across several platforms.
Due to those incidents, my manager had me stop some of our reporting policies because we were being swamped with false alerts.
This morning we missed an actual GCP outage because our alerts were not automatically sending the alerts through the policies that we had stopped.

Is there a way to determine which alerts are due to NR incidents and not send those false alerts through the policies but send through the real ones so that we do not miss actual outages?

Thank you for your question, TripArtrip. I understand that recent New Relic incidents in which you received false alerts have led you to lose confidence in your alert policies and disable them. Unfortunately, disabling these policies has resulted in you missing an important alert when you did have an issue. When your systems are down, you expect New Relic to alert you. When they’re fine, the last thing you expect is New Relic to send you a false alert.

We are working diligently to address sources of instability in our platform so that you receive the alerts you’re supposed to receive, and no false alerts. At the same time, we know some issues are inevitable, so we also have mechanisms for suppressing false alerts in some cases that we leverage whenever we can. Clearly, our false alert suppression mechanisms are not perfect and we continue to refine them. However, we are making our best effort to suppress false alerts, so I recommend you keep enabled alert policies that you rely upon to notify you of important issues. We apologize for our incidents of instability, especially when those have resulted in false alerts getting through to customers like you.

One action you can take, however, to reduce the chances of receiving some kinds of false alerts, is to check the loss-of-signal settings on your NRQL alert conditions. Some false alerts are caused when data flow is interrupted in New Relic’s infrastructure, resulting in the alerting system triggering a loss-of-signal alert. You can read our documentation on how loss-of-signal alerts work, and decide if you really need that kind of detection for your use case, disabling loss-of-signal where it is unnecessary.

2 Likes

Thank you for the suggestion for possible actions I can take. I will definitely look into the loss-of-signal settings. Also thank you for the link to the documentation.
One of the actions I have taken is to consolidate our alerts. I am covering 12 different customer sites that have 4-5 possible alerts each. The sheer number of alerts that we received with a false alerting incident was overwhelming. By consolidating them to 1 per policy I have cut down the volume considerably. Between this and your loss-of-signal suggestion I think I will be able to enable the reporting to our Slack channel again.