Long time to resolve an incident

Hi,

We had two incidents today that took much longer than they should to resolve and they resulted in a call to the on-call person. Note: only one monitor ping failed.

I’m not sure whether the problem was on NewRelic or VictorOps side.

How can I diagnose the root cause of this problem?

Root cause of the problem with the app or a problem with Synthetics?

How were your pings from other locations at the same time?

I’ve been seeing network issues on the west coast last couple of days.

Hey,

With Synthetics. My understanding is that when a NR ping fails then it notifies VictorOps which then creates an incident. Every incident waits 20minutes before it gets escalated and this should be plenty of time to report that the subsequent ping worked.

In this case, it took over half an hour for the incident to get resolved.

Hi @pawel.pabich,

I can’t say that I’m too familiar with VictorOps, but I would assume that the workflow works something like this (based on supporting Synthetics/Alerts + similar notification systems like PagerDuty)

  1. Synthetics monitor fails
  2. Synthetics opens a violation (and an incident when applicable) in Alerts.
  3. When the incident opens, Alerts sends a notification to the service or notification channel (i.e. VictorOps, email, webhook, etc.)

That said, I suspect that this may actually be an issue with Alerts. If you can track down the incident that corresponds to this failure, it might reveal a bit more information. Specifically, it’ll allow you to see where and what notifications were sent, what timestamps were, and so on.

If you’re specifically looking to troubleshoot on the Synthetics side, you should be able to navigate to the specific failure to see more info. The extent to which verbose logs are available depends on the monitor, so there may be a limited amount of information that you can gather here with a ping monitor.

I’ll reiterate that I’m not overly familiar with VictorOps, but this should hopefully provide a bit more context with regards to how you could troubleshoot this. Any thoughts? Let me know!

Thanks.

I had a look at the data in Alerts&AI and it looks like the problem was on Newrelic side. If I read the data correctly then the Close event was raised at 11:33 am which is ~40 minutes after the incident was opened. Note: only one ping failed so around ~25-30 successful pings were ignored here.

Hey @pawel.pabich, I can’t definitively say if this stems from an issue on the New Relic side but I’d be happy to investigate this further for you. Could you provide me with a permalink to this incident here so I can take a look? Thanks!

1 Like

The link includes our account id which I would rather not share here. Isn’t the incident id, shown on the screen, enough? Is there a way I can provide this information without making it available to the Internet?

THanks

Pawel

@pawel.pabich you could DM @Masen the link instead if you do not want put it in this thread.

Good idea. Done…

1 Like