Starting on Monday 4 October, we will be giving users two new streaming aggregation methods to select from. This will be rolled out fully as the week progresses. If you don’t see it yet on your account, it should show up by the end of the week. If you would like to read more about which use-cases work best with which method, head over to this link after reading this article.
A “streaming aggregation method” is the logic that tells us when we have all the data for a given aggregation window. Once the window is closed, the data is aggregated into a single point and evaluated against the threshold.
WHY: Data latency has caused many users to have inaccurate alert violations, since data has the potential to come in too late to be evaluated. This change has two benefits.
- Allows users the flexibility to choose an aggregation method that matches their data’s behavior.
- Fewer false alerts due to latent data
- Improves time-to-detection for incidents, allowing for faster time-to-resolution
- Old method required adding extra time-to-resolution in order to deal with data latency
WHAT: Two new aggregation methods, to exist alongside the classic aggregation method
- Cadence (the classic aggregation method)
- Event Flow (the new default aggregation method)
- Event Timer (the other new aggregation method)
Below, I’ll be explaining our new streaming aggregation methods and contrasting them with the one we’ve been using all along, so that users such as yourself can better understand what you’re selecting when creating a new NRQL alert condition.
IMPORTANT NOTE: Existing conditions will not be automatically changed. When you visit a condition that has not yet been adjusted to use the new settings, you will be prompted to choose from among the legacy option and the two new options.
Here are a few key concepts:
Timestamp: this is the timestamp that comes with a data point sent to New Relic.
Wall-clock time: the server clock time at New Relic.
Aggregation window: another name for the data bucket that is aggregated and evaluated; you control the size of this window using the Window duration setting.
Window duration: this is the size of the data buckets which are aggregated and evaluated during streaming alerts aggregation. Abbreviated as “WD.”
Delay/Timer: formerly known as “Evaluation Offset,” this setting has a couple of different functions, depending on the method being used
In the following sections, when I mention that an evaluation window is “shipped,” I mean that all data points in it are aggregated down to a single data point and that data point is evaluated against the alert threshold. Once an aggregation window is shipped, it can accept no further data points.
Now, on to the methods:
This is the method you’re used to. Each evaluation window waits exactly as long as the Delay setting, using the wall-clock time as a timer. This can result in significant amounts of data being dropped for coming in too late to be evaluated. Data being dropped can cause false alerts.
With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait for 3 full minutes beyond its “own” minute. The 9:55 window will ship the moment the wall-clock strikes 9:59.
With WD set to 1m and Delay set to 0: Each 1-minute evaluation window will ship as soon as its time is done. The 9:55 window will ship the moment the wall-clock strikes 9:56.
This works best for sources that come in frequently and with low event spread.* Metrics with high throughput are a good example of this. This is the new default streaming aggregation method.**
Each aggregation window will wait until it starts to see timestamps arrive that are past its own delay setting. As soon as that happens, it will ship. Note that this relies on the timestamps of arriving data, and the wall-clock time is no longer relevant. In other word, if data stops flowing, the system waits for more data before doing any aggregation.
With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait until timestamps start to arrive that are 4 minutes out (3 minutes’ worth of delay past the single minute aggregation window). The 9:55 window will ship the moment the system starts to see timestamps arriving from 9:59 or later.
With WD set to 1m and Delay set to 0: Each 1-minute aggregation window will ship as soon as the system sees timestamps arriving outside of that aggregation window. The 9:55 window will ship the moment the system starts to see timestamps arriving from 9:56 or later.
This method is best for data that comes infrequently and potentially in batches. Cloud Integrations data is a good example of this, as are infrequent error logs.
Each aggregation window has a timer on it, set to the Delay setting (which, for this aggregation method, is called Timer in the UI). The Timer starts running as soon as the first data point shows up for that aggregation window (based on the data point’s timestamp). Every time another data point arrives for that window, the timer is reset. Once the timer reaches 0, the aggregation window is shipped.
Note that if later aggregation windows are ready to ship, but an earlier window’s timer is still running, the later windows will wait for the earlier window to finish and ship before they themselves are shipped. For example, if the 9:56 window’s timer reaches 0, but the 9:55 window’s timer hasn’t, the 9:56 window will wait until the 9:55 window’s timer also reaches 0, at which point the 9:55 window will ship, closely followed by the 9:56 window.
With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait 3 minutes after the last data point shows up for that window. Every new data point showing up will reset the timer back to 3 minutes. Once the timer reaches 0, the window will ship.
With WD set to 1m and Delay set to 0: Each 1-minute aggregation window will ship as soon as the first data point shows up.
For more information, see the official documentation at Streaming alerts: key terms and concepts | New Relic Documentation.
*“Low event spread” means data that comes in mostly in-order. For every 1-minute processing window, there should be minimal difference between the greatest and the least event time. An example of low event spread would be an application sending metrics steadily. At 12:00, there should be metrics coming in from 11:59 and maybe 11:58. An example of high event spread would be the same 12:00 window, but data is coming in from 11:34, 11:42, 11:51 and 11:58 – this would not work well with the Event Flow aggregation method.
**We are changing the default selection to Event Flow, as we feel that this aggregation method is more likely to result in fewer dropped data points due to latency in most situations, and decreased time-to-detect.