Customer: a violation was triggered at 10:36 but the incident and email notification was received at 10:38. Why did we have to wait 2 whole minutes before we were notified of the incident? What gives??
New Relic: When a NRQL alert condition is created, tucked away at the bottom of the edit screen is Condition Settings > Advanced Settings > Evaluation offset with a default setting of 3 minutes. So what exactly is an Evaluation offset?
As described in our NRQL Alert Conditions docs the results of the query are being evaluated every minute in one-minute windows, however the start time of the query depends on the Evaluation offset which, in the default case as mentioned above, is 3 minutes. This really means that the incoming data is being queried with a start time of 3 minutes ago to 2 minutes ago. It’s not queried in real time, but in the past by a tiny bit.
Alerts does its thing, happily querying incoming data every minute, and keeps its eyes peeled for any data that breaches an alert condition’s threshold. The moment that alert condition’s threshold is breached, both a violation and an incident are opened at the very same time! In real time! But. Only the violation is backdated to 2 minutes previously; in other words, 2 minutes behind realtime. This is representing that 3-minutes-ago-to-2-minutes-ago window mentioned above. What is perceived as a gap between violation and incident is actually false.
So why 3 minutes?
Since event types from agents are aggregated from different sources, anything less than 3 minutes will introduce false positives (alert notifications when you don’t need them! Egads!) and/or false negatives (you aren’t being notified of incidents when you should be! Horrors!). For Cloud data (i.e., from AWS integrations) this Evaluation offset should be expanded even further than 3 minutes. I encourage you to check out this very cool article about the mind-bendy subject of data latency.
Continuing on, let’s say that this Alerts condition is evaluating a threshold value below X for at least 10 minutes. The violation is triggered (backdated with a timestamp of 10:36) and an incident opened (with a realtime timestamp of 10:38). Alerts continues sampling the data in one-minute intervals. As soon as the data slips above X, aka is in the green, a recovery period is kicked off. If the data has maintained ‘green’ levels for 10 minutes (as per the alert condition), the violation and incident are closed at the same time in real time. Again, there will be a difference between timestamps of the violation (backdated to the beginning of the recovery period) and the incident (timestamped in real time).
Check out this fantastic, in-depth article on alert incident timelines for more detail on what happens when a violation closes.
Customer: Hey, thanks for that explanation! It makes sense now!