If you’ve used Alerts before, you’re probably familiar with what a powerful tool it can be when you need to get notified about disruptions and other technical hiccups. If you fall into this category, you may also be familiar with the frustration associated with a violation that won’t close or a notification that never came through. Configuring Alert conditions to avoid these kinds of headaches can be a daunting task, but a lot of common issues can be avoided with a proper understanding of how Alerts evaluates data and what kinds of behaviors can cause things to go wrong. Here, we’ll break down the lifecycle of an incident and outline how evaluation in Alerts works so that unexpected frustrations can be avoided. Let’s dive in!
How we evaluate data
Alerts evaluates data in discrete, one-minute chunks. This means that the Alerts evaluation system is looking at data points minute-by-minute and making determinations based off of that. In other words, Alerts is evaluating a data point once per minute in order to determine whether or not a condition threshold is being breached. This minute-by-minute evaluation system is an integral part of Alerts and paying special attention to the significance of this evaluation time structure can save a lot of time in the long term.
This timing structure is consistent across-the-board with different types of Alert conditions. Assume you have a Host Not Reporting condition that is set to open a violation after 5 minutes. In this case, Alerts is looking for a piece of data from the New Relic Infrastructure agent once per minute. While the host is online and things are working normally, Alerts will check for the host’s entityId once per minute (more on that here). If the host goes down and takes its respective entityId with it, then Alerts will continue to conduct these checks once per minute and will open a violation after evaluating 5 separate data points on a minute-by-minute basis.
This concept may seem simple enough, but it raises an important question. If Alerts is looking for a data point every minute, then what happens if there is no data? If this is the case, then no evaluation happens on that minute. Perhaps more importantly, Alerts inserts a
NULL value and discards any count that it was holding on to. When it comes to evaluation, the significance of this can not be understated. Imagine you have a
for at least 5 minutes condition and you have four minutes of data reporting, but then something breaks and no data is coming in when the fifth check happens. At this point, Alerts will drop in a
NULL value and start the evaluation over at the first minute in a new time window.
A violation opens - now what?
Now that the violation has opened, Alerts will look at the condition’s policy to determine whether the violation will open an incident or “roll up” under an existing incident. Whether or not a new incident is opened depends on the policy’s incident preferences. In this case, assume an incident opens.
Understanding minute-by-minute evaluation and how this occurs when there is an absence of data is important because the evaluation system works on the same minute-by-minute basis regardless of whether a violation is being opened, evaluated, or closed. Imagine you have an Infrastructure condition that opens a violation because it targets a threshold where
Disk Used % has a value above 90 for at least 5 minutes… At this point, the evaluation system is looking for 5 consecutive data points showing < 90% fullness across 5 minutes in order to automatically close.
This brings up another question - what happens if you remove the drive? Will the incident close?
In this case, it will not ever automatically close because Alerts now sees
NULL values every time it looks to see how full the disk is.
While the disk may not technically be “full” anymore, the evaluation system is never seeing the actual inverse of the threshold. In this case, the inverse is a numerical value under 90. In order to close the violation, Alerts wants to see 5 consecutive minutes worth of values between 0 and 90. If
NULL gets thrown into the mix, then everything is thrown out and the evaluation system starts back at the first numerical data point in order to consider when a violation gets closed.
In order to illustrate this, assume that there is an open violation under this disk fullness condition and the evaluation system registers
81 consecutively across the course of 4 minutes. What if it registers
56 as the next value? In that case, the violation will close. If the disk loses connection and the evaluation system sees a
NULL value, then those four minutes worth of data are thrown out and the evaluation system starts back at the first minute. After this, a new set of 5 consecutive data values between
90 are needed in order to close the violation. You can read more about how Alerts violations close in our documentation here: How violations automatically close.
The intricacies of this evaluation system can cause issues, but understanding how evaluation works minute-by-minute and how it works when there is an absence of data can save you from a huge variety of pitfalls. Keep this idea in mind next time you get stuck with a violation that won’t close. Happy alerting!