Relic Solution: Demystifying timestamp discrepancies with NRQL alert conditions

The scenario:

Customer: a violation was triggered at 10:36 but the incident and email notification was received at 10:38. Why did we have to wait 2 whole minutes before we were notified of the incident? What gives??

The answer:

New Relic: When a NRQL alert condition is created, tucked away at the bottom of the edit screen is Condition Settings > Advanced Settings > Evaluation offset with a default setting of 3 minutes. So what exactly is an Evaluation offset?

As described in our NRQL Alert Conditions docs the results of the query are being evaluated every minute in one-minute windows, however the start time of the query depends on the Evaluation offset which, in the default case as mentioned above, is 3 minutes. This really means that the incoming data is being queried with a start time of 3 minutes ago to 2 minutes ago. It’s not queried in real time, but in the past by a tiny bit.

Let’s say the threshold configuration for this example alert condition is set to: query returns a value below X for at least 10 minutes. Alerts does its thing, happily querying incoming data every minute, and keeps its eyes peeled for any data that breaches an alert condition’s threshold. If the values being returned are below X, a degradation period is kicked off. The moment that alert condition’s threshold is breached, aka when the values have been below X for a full, consecutive 10 minutes, both a violation and an incident are opened at the very same time! In real time! But. Only the violation is backdated to 2 minutes previously; in other words, 2 minutes behind realtime. This is representing that 3-minutes-ago-to-2-minutes-ago window mentioned above. What is perceived as a gap between violation and incident is actually not a gap.

So why 3 minutes?

Since event types from agents are aggregated from different sources, anything less than 3 minutes may introduce false positives (alert notifications when you don’t need them! Egads!) and/or false negatives (you aren’t being notified of incidents when you should be! Horrors!). For Cloud data (i.e., from AWS integrations) this Evaluation offset should be expanded even further than 3 minutes. I encourage you to check out this very cool article about the mind-bendy subject of data latency.

Continuing on, the violation is triggered (backdated with a timestamp of 10:36) and an incident opened (with a realtime, wall clock timestamp of 10:38). Alerts continues sampling the data in one-minute intervals. As soon as the data slips above X, aka is in the green, a recovery period is kicked off. If the data has maintained ‘green’ levels for a consecutive 10 minutes (as per the alert condition), the violation and incident are closed at the same time in real time. Again, there will be a difference between timestamps of the violation (backdated to the beginning of the recovery period) and the incident (timestamped in real time, or wall clock time). There’s way more detail about this in this other very cool article.

One last thing: if your alert condition configuration is at least once in instead of for at least, there will be no degradation period. At least once in means that as soon as a value is below X, a violation will open. And, of course, as soon as the value is back above X for the configured amount of time, the violation will close.