This post is now deprecated - replaced by Host Not Reporting alerts in Infrastructure
A lot of the questions in our Alerts topic center around the future of Server/Agent not reporting conditions. It’s a feature that a lot of people make heavy use of but it’s also been the source of a lot of notifications that people consider false positives, even when the violations adhere to a strict interpretation of what Server Not Reporting means. I’d like to take a minute to talk about that strict interpretation before addressing some of the problems with it.
First off, we’re not saying that your server was down. I think it’s important to make this distinction because confusion over this has prompted a lot of conversations during my time with New Relic. Instead, what these notificattions mean is that, for whatever reason, New Relic was unable to collect data from your server for the period of time you’ve specified in your condition. There are a lot of problems inherent in that mechanism though. One we see pretty commonly is that network issues existing transiently between our infrastructure and your’s can lead to data not being sent up by the agent or not being received in a timely manner.
For instance, if your infrastructure is located somewhere in the EU, there are a limited number of routes for the data to take across the Atlantic. If a node goes down along the route your traffic was being routed through it can take time to find a new route and then your traffic must join both the existing traffic and the other traffic that was also using the same route your’s was. This can set off a cascade of false positive alarms for customers and when you’ve got thousands of servers, owned by dozens of teams, in a handful of countries you’re starting to talk about getting a lot of people out of bed.
To circle back, this is absolutely a period where your server was unable to report, so we’ll send out Server Not Reporting notifications. Your server might have been perfectly healthy and even serving traffic the entire time, though. We want to try and ensure that we’re only notifying you when there really is a problem with your infrastructure so that you and your teams can trust that it’s worth getting out of bed in the middle of the night. We also know how hard noisy alarms can be for on-call engineers and we’re trying to not be the app that cried wolf. We’re working hard to find an intelligent, scalable solution that walks that line without having to burden an ops team with too much manual configuration.
The “notify me when you didn’t hear from this server in awhile” functionality is still available today but we’ve learned a lot from seeing how it works and how it fails. It didn’t make sense to put a fresh coat of paint on the existing fault-intolerant implementation and shoe-horn it into new Alerts, particularly since it continues to be a option in our legacy alerting tools. Because we realize that this is “good enough” for some use cases we didn’t want to remove a monitoring tool that people depend on but we want to move forward with our efforts and build it smarter.
The reality is that it takes a lot of time and effort to build smart tools that can function for all of our customers simultaneously. We want to be both transparent and honest when we’re communicating and sometimes that means not saying something. The roadmap for Alerts and the timeline for when we think that we will reach certain goals along it are constantly shifting. We can’t always share the roadmap or publish something that could be construed as a promise.
Progress can be unpredictable and priorities shift over time. Sometimes we decide to solve a problem using a new tool that isn’t available or using a novel feature that isn’t ready to be discussed in a public forum. I know it’s frustrating to not be able to see inside the processes of New Relic or the heads of our product managers when you’re waiting for a feature that is crucial to your operations. When we can update everyone we will do so in the shortest time frame possible.
For those still interested in configuring Server Not Reporting conditions using the legacy alerting system, you can find instructions for that in our public docs.
If you’ve got feedback about this feature please leave it in the poll that @NateHeinrich created for that purpose.