Relic Solution: Understanding the "Three Strikes" Behavior in Synthetics

Synthetics monitors can produce failures for a variety of reasons. Broadly speaking, Synthetics provides useful features and observability functionality that can give a user visibility into when and why their application might be failing to load or work as anticipated. Monitor failures can be useful indicators of application performance, but what is a user to do when they want to avoid excess noise?

One question that New Relic Support sees fairly often relates to whether-or-not it is possible to create some sort of “three strike” condition for monitor failures. In other words, how can a user avoid excess noise that might come from a momentary hiccup or some other acute issue? What if a user could make it so that a failure isn’t recorded unless a monitor fails three times subsequently instead of once?

It can be easy to miss, but Synthetics actually utilizes this functionality out-of-the-box. In other words, it’s already checking for three failures behind-the-scenes before it registers a failure. When a monitor runs and sees a failure, the series of events proceeds like this:

  1. The monitor runs a check and fails for the first time. At this point, there won’t be anything that appears from a user-facing perspective. On our side, a “soft failure” is recorded.
  2. After the soft failure is recorded, two additional checks from the failed location are scheduled at the front of the queue. At this point, the checks will run almost immediately and will happen at nearly the same time.
  3. If the second check returns a failure, it is recorded on our end again and considered a second “soft failure”.
  4. If the third check also fails, then a user-facing failure is recorded and registered as an official failure.

What does this mean in terms of real world usability? A common use case between Synthetics and Alerts would be to set up a NRQL condition that monitors Synthetics checks and opens a violation if multiple failures are recorded. For example, the NRQL condition might utilize a query that looks like this:

SELECT count(*) FROM SyntheticCheck WHERE result = 'SUCCESS' FACET monitorName*

If you have a monitor that runs every minute and you want to see a violation open after three failures, then you might set the threshold here to open a violation if query returns a value below 3 for at least 3 minutes. Easy enough, right? This may be exactly what you want, but some users might find it redundant since it’s technically looking for 9 failures when you account for the other checks that are happening behind-the-scenes.

The ultimate takeaway here is that this three-strike system serves as a built-in safeguard against false positives and brief hiccups that might cause a monitor to incorrectly record a failure. A NRQL condition might still be your best bet if you’re looking to get alerted about sustained downtime, but that might be unnecessary if you’re simply trying to prevent false positives that could result from a single failed check.

*Curious about why you’d want to query for successes rather than failures with a use case like this? A fellow Relic explains this in detail here: Relic Solution: Creating Well-Behaved Faceted NRQL Alert Conditions

2 Likes