Synthetic monitor still active even after being disabled

Hi,

I have a monitor setup that is disabled, but the monitor itself is still active?
I thought when we disable the monitor it would stop pinging?


Regards
John

Hey @john.simons, this may have been the result of a race condition between our web interface and the service that actually controls the monitor’s settings.

Enabling and disabling the monitor again should get it to stop, but please let me know if that doesn’t do the trick and we can investigate further.

@jeffrey_s, we are managing 1500 endpoints using your API. We don’t use the UI at all.

When this happens the person on-call gets woken up in the middle of the night by VictorOps that we have integrated with NewRelic.

A few questions:

  1. Can this race condition be eliminated on your side?
  2. Can you extend the API so we can indicate that we want to pause a monitor and resolve all outsending incidents?
  3. Open to other ideas

I checked this monitor today and it is still active even though it was disabled 5 days ago. It looks like the system did not recover from that race condition.

Is there a way to stop it? It creates a bit of unnecessary noise on our side.

Thanks

Hey @pawel.pabich

@acuffe and I looked at this after we saw your tweet today. We have played with a few monitors & the API. We are sending a patch call that looks like this:

curl -v \
     -X PATCH -H 'X-Api-Key:NRAA-myAdminApiKey' \
     -H 'Content-Type: application/json' https://synthetics.newrelic.com/synthetics/api/v3/monitors/my-monitor-id \
     -d '{ "status" : "DISABLED" }'

And this is working first time for us. We see the monitor is disabled and this is also confirmed by

SELECT latest(timestamp) FROM SyntheticCheck WHERE monitorId = 'my-monitor-id'

So, I believe Jeffrey is right here, there must be some strange race condition that we can’t replicate that may be interrupting the API from disabling your monitors.

Could you share in detail exactly how your API calls run? If we can replicate this then it will be easier for us to help our engineering teams find a solution.

Hi @RyanVeitch,

Sure. I’m happy to provide any details you need.

This is how we disable a monitor (in pseudo code):

  1. var monitor = GetMonitor(id)
  2. monitor.status = ‘DISABLED’
  3. UpdateMonitor(monitor)

We are managing around 1500 monitors(one per instance) and have seen this problem only a few times. That being said, this is just one of many “sync” problems we’ve been hit by.

From where we sit it looks like something changed recently in the way NR processes requests and code that was rock solid for 6 months is failing now at least once a week.

@pawel.pabich OK i’ll ask internally if any changes have occured around the Synthetics REST API but could you elaborate on “one of many sync problems we’ve been hit by”.

If this is a related issue if we can source how it’s happening and especially if we can replicate then we can get more traction on a solution in engineering.

When you run this on 1500 monitors, is it all at once in a large script? Or is the script running individually as required on specific monitors and 50 times it’ll work fine and the 51st it doesn’t seem to update? I’d like to narrow the scope around the issue and get to the root.

2 Likes

Hi,

2 more examples of “sync” problems

When you run this on 1500 monitors, is it all at once in a large script? O

We don’t create 1500 monitors on a regular basis. We have 1500 instance of our application running in the cloud.

We create ~20 monitors a day, we disable ~15 a day and we delete ~10 a day.

It looks like there multiple backend systems and sometimes messages between them are either lost or processed out of order.

Thanks for the update here @pawel.pabich

I’ll do some more testing to try to replicate this. And once we have that, we’ll make sure the engineering team sees this. :slight_smile:

1 Like

Hey @pawel.pabich

I did some testing earlier this morning and some yesterday too.

I set up a script to run UPDATE curl commands on 6 monitors every few minutes, while this was running for a few hours I had a dashboard showing a couple of things. 1) Latest timestamp of each monitor run. 2) the total count of currently running monitors.

Over the few hours that this was running, the monitors never went out of sync. There was never a period where a monitor was running when it shouldn’t be, or not running when it should be.

So we’re kinda stuck here. Like I said before it’s far easier for us to engage the engineering team on a bug report when the issue is replicable.

We can’t replicate this. I’d still like to get you in touch with folks who can help. I’m going to get you into a support ticket, my hopes are that my colleagues in Support can either try replicating this themselves, or, they can work with our engineering team to have them replicate this. You should see that ticket in your email inbox.

Thanks for your patience while I looked into that with @acuffe - Do let us know what you learn in your ticket.

1 Like

Thank you. I appreciate the effort.

I will definitively let you know when we see it again and I will make sure the affected monitor(s) are left untouched for you to examine.

Regards,

Pawel