This morning’s New Relic issue was a nice heart attack. We do use multiple systems to monitor for outages on our platform, but New Relic is a key tool for us. The APM and Data Collectors have been acting up too often recently. What is happening with stability on the New Relic platform? I’m losing trust in the data I see from NR.
I want to apologise for any impact this interruption had on you, and your trust in New Relic.
I understand service interruptions like that which we faced today are very frustrating especially on services that you rely on - they are for us too!
With that said I do want to assure you that we do a huge amount of work around system reliability, and consistently work hard to ensure we have as few outages as possible. For each incident we have, we run full retro’s to go back over the events and narrow down on what went wrong. Find a root cause, and ensure that we don’t have incidents with the same cause again.
That doesn’t take away from the fact that service interruptions happen, no matter how severe, they can have impact on your business if you are utilising New Relic at the time of the interruption. We do want to keep you, the user, as up to date as we can, so our Support team mobilises to work closely with engineering teams to ensure we update status.newrelic.com with the most accurate information at a given time.
In this case that is what happened here - Our engineers were notified as soon as the problem began, our Support team worked closely with them while they resolved this issue, and I can confirm that now services are back to 100% functionality.
Please feel free to DM me here if you have any additional feedback we can take on board.
Please also note that I have shared this post with your account team, so they may reach out to discuss your thoughts further.