New aggregation methods for NRQL alert conditions

Starting on Monday 4 October, we will be giving users two new streaming aggregation methods to select from. This will be rolled out fully as the week progresses. If you don’t see it yet on your account, it should show up by the end of the week. If you would like to read more about which use-cases work best with which method, head over to this link after reading this article.

A “streaming aggregation method” is the logic that tells us when we have all the data for a given aggregation window. Once the window is closed, the data is aggregated into a single point and evaluated against the threshold.

WHY: Data latency has caused many users to have inaccurate alert violations, since data has the potential to come in too late to be evaluated. This change has two benefits.

  • Allows users the flexibility to choose an aggregation method that matches their data’s behavior.
  • Fewer false alerts due to latent data
  • Improves time-to-detection for incidents, allowing for faster time-to-resolution
  • Old method required adding extra time-to-resolution in order to deal with data latency

WHAT: Two new aggregation methods, to exist alongside the classic aggregation method

  • Cadence (the classic aggregation method)
  • Event Flow (the new default aggregation method)
  • Event Timer (the other new aggregation method)

Below, I’ll be explaining our new streaming aggregation methods and contrasting them with the one we’ve been using all along, so that users such as yourself can better understand what you’re selecting when creating a new NRQL alert condition.


IMPORTANT NOTE: Existing conditions will not be automatically changed. When you visit a condition that has not yet been adjusted to use the new settings, you will be prompted to choose from among the legacy option and the two new options.


Here are a few key concepts:

Timestamp: this is the timestamp that comes with a data point sent to New Relic.

Wall-clock time: the server clock time at New Relic.

Aggregation window: another name for the data bucket that is aggregated and evaluated; you control the size of this window using the Window duration setting.

Window duration: this is the size of the data buckets which are aggregated and evaluated during streaming alerts aggregation. Abbreviated as “WD.”

Delay/Timer: formerly known as “Evaluation Offset,” this setting has a couple of different functions, depending on the method being used

In the following sections, when I mention that an evaluation window is “shipped,” I mean that all data points in it are aggregated down to a single data point and that data point is evaluated against the alert threshold. Once an aggregation window is shipped, it can accept no further data points.

Now, on to the methods:

Cadence



This is the method you’re used to. Each evaluation window waits exactly as long as the Delay setting, using the wall-clock time as a timer. This can result in significant amounts of data being dropped for coming in too late to be evaluated. Data being dropped can cause false alerts.

With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait for 3 full minutes beyond its “own” minute. The 9:55 window will ship the moment the wall-clock strikes 9:59.

With WD set to 1m and Delay set to 0: Each 1-minute evaluation window will ship as soon as its time is done. The 9:55 window will ship the moment the wall-clock strikes 9:56.

Event Flow



This works best for sources that come in frequently and with low event spread.* Metrics with high throughput are a good example of this. This is the new default streaming aggregation method.**

Each aggregation window will wait until it starts to see timestamps arrive that are past its own delay setting. As soon as that happens, it will ship. Note that this relies on the timestamps of arriving data, and the wall-clock time is no longer relevant. In other word, if data stops flowing, the system waits for more data before doing any aggregation.

With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait until timestamps start to arrive that are 4 minutes out (3 minutes’ worth of delay past the single minute aggregation window). The 9:55 window will ship the moment the system starts to see timestamps arriving from 9:59 or later.

With WD set to 1m and Delay set to 0: Each 1-minute aggregation window will ship as soon as the system sees timestamps arriving outside of that aggregation window. The 9:55 window will ship the moment the system starts to see timestamps arriving from 9:56 or later.

Event Timer



This method is best for data that comes infrequently and potentially in batches. Cloud Integrations data is a good example of this, as are infrequent error logs.

Each aggregation window has a timer on it, set to the Delay setting (which, for this aggregation method, is called Timer in the UI). The Timer starts running as soon as the first data point shows up for that aggregation window (based on the data point’s timestamp). Every time another data point arrives for that window, the timer is reset. Once the timer reaches 0, the aggregation window is shipped.

Note that if later aggregation windows are ready to ship, but an earlier window’s timer is still running, the later windows will wait for the earlier window to finish and ship before they themselves are shipped. For example, if the 9:56 window’s timer reaches 0, but the 9:55 window’s timer hasn’t, the 9:56 window will wait until the 9:55 window’s timer also reaches 0, at which point the 9:55 window will ship, closely followed by the 9:56 window.

With WD set to 1m and Delay set to 3m: Each 1-minute aggregation window will wait 3 minutes after the last data point shows up for that window. Every new data point showing up will reset the timer back to 3 minutes. Once the timer reaches 0, the window will ship.

With WD set to 1m and Delay set to 0: Each 1-minute aggregation window will ship as soon as the first data point shows up.

For more information, see the official documentation at Streaming alerts: key terms and concepts | New Relic Documentation.


*“Low event spread” means data that comes in mostly in-order. For every 1-minute processing window, there should be minimal difference between the greatest and the least event time. An example of low event spread would be an application sending metrics steadily. At 12:00, there should be metrics coming in from 11:59 and maybe 11:58. An example of high event spread would be the same 12:00 window, but data is coming in from 11:34, 11:42, 11:51 and 11:58 – this would not work well with the Event Flow aggregation method.

**We are changing the default selection to Event Flow, as we feel that this aggregation method is more likely to result in fewer dropped data points due to latency in most situations, and decreased time-to-detect.

7 Likes

Are there limits to these settings? For instance, is there a maximum timer for the Event Timer setting?

2 Likes

Thanks for the detailed breakdown, @Fidelicatessen, much appreciated. I’m surprised this wasn’t included in the roadmap announcement a few weeks back. How, if at all, are these related to the streaming alerts announcement late last year?

Given alerts are a core part to any monitoring tool, I take it these developments are integrated consistently through New Relic. Could you confirm these alerting methods are implemented by NR’s Terraform provider as large-scale enterprise-level configuration of alerts are usually managed via Terraform. Since this is live from Oct 4th, is Terraform support expected by the end of the week, Oct 8th?

With regards to the 4 golden signals for SRE, could you mind help me understand which method is best for each signal:

  • Latency: Event flow?
  • Traffic: Event flow?
  • Errors: Event timer?
  • Saturation: Event flow?

Do I have the above right? Are there situations where “cadence” would make more sense than the newer methods?

Thanks for your time.

1 Like

Hello @Fidelicatessen, thanks for explanation.
Do you know when this feature will be available for the EU accounts ?

Hi folks, thanks for your questions!



@dkoyano

There is a minimum setting of 5 seconds for Timer, 0 seconds for Delay, and the maximum setting for Aggregation Windows is 15 minutes (minimum setting is 30 seconds). As a side note, we do plan on increasing the maximum by the end of the year.



@Rishav

How, if at all, are these related to the streaming alerts announcement late last year?

Moving everyone to the new streaming platform allowed us to deliver this final piece, which is the capstone of the improvements we had planned.


Given alerts are a core part to any monitoring tool, I take it these developments are integrated consistently through New Relic. Could you confirm these alerting methods are implemented by NR’s Terraform provider as large-scale enterprise-level configuration of alerts are usually managed via Terraform. Since this is live from Oct 4th, is Terraform support expected by the end of the week, Oct 8th?

Our developers have been working on this, and the code is complete but waiting on being merged. We hope to have it out by the end of the week.


With regards to the 4 golden signals for SRE, could you mind help me understand which method is best for each signal:

  • Latency: Event flow?
  • Traffic: Event flow?
  • Errors: Event timer?
  • Saturation: Event flow?

The best way to determine which method to use is to ask yourself if your data comes in at a regular interval without too much spread between timestamps. If this is the case, Event Flow is the best method – your data will not be aggregated until data starts coming in from subsequent aggregation windows.

If your data comes in inconsistently or with a large timestamp spread, Event Timer is a better bet, since the “timer” in this case works independently of data coming in to subsequent aggregation windows.


Are there situations where “cadence” would make more sense than the newer methods?

Although we have not yet found any use case where the “cadence” method is preferable, we invite you to share with us any use-cases you find of this nature.



@dario.mango

The EU will be getting this feature turned on this week along with our customers on the US data center.

2 Likes

Thanks for the comprehensive response, @Fidelicatessen, the clarification is greatly appreciated. And terrific to see the Terraform implementation go-live today as well with v2.27.1’s release with regards to NRQL alert conditions (documentation link). Pleasantly surprised by such a cohesive release execution!

Given that the rate of error is a discrete value which can be near-0 for extended periods of time before an incident causing failing requests occur, it seems like Event Timer might the best bet. While the rest of the signals tend to flow from the last detected value more smoothly, making Event Flow more appropriate. I think. Please let me know if I’ve got the right understanding.

Looking forward to roll-out this update via Terraform and get a better idea of its impact in the not-so-distant future. Thanks once again.

2 Likes

Hi @Rishav

Pleasantly surprised by such a cohesive release execution!

Thank you! Our engineering teams worked hard to get the UI, API and Terraform components all released at once.

Given that the rate of error is a discrete value which can be near-0 for extended periods of time before an incident causing failing requests occur, it seems like Event Timer might the best bet. While the rest of the signals tend to flow from the last detected value more smoothly, making Event Flow more appropriate. I think. Please let me know if I’ve got the right understanding.

The biggest difference between Event Flow and Event Timer is that, in order to move the system forward, the former requires another data point to come into a subsequent aggregation window, while the latter does not. If you have data coming in every minute (or whatever you’ve set the aggregation window at), Event Flow should work great. However, if your data comes in inconsistently, sometimes skipping entire aggregation windows, Event Timer will work better, since each window is aggregated based on a discrete timer.

If your error rate is producing data points every minute (or, again, whatever you have aggregation window set to), Event Flow should work great. If you only get a data point every once in a while, though, I would recommend Event Timer.

Hopefully this makes sense – let me know if it doesn’t!

2 Likes

Thanks @Fidelicatessen, I really appreciate you taking the time to clarify. It seems like the default Event Flow might be the best option to switch all existing alerting to and then take it from there.

Can’t wait until these NRQL alerts:

  • Update the traffic-light signal for each entity’s health indicator, like APM/Infra/Synth ones.
  • Overlay their violation regions on dashboards, like APM charts.

Hope to see this pace of delivery progress maintained! :crossed_fingers:

Hi @Rishav!

I just wanted to share some good news with you…

We’re rolling this out gradually right now. I’m going to run a test on my own account right after I eat lunch, but it should be working perfectly. Your account should be enabled by next week, if it isn’t already!

1 Like

That’s excellent! Thanks for the heads-up, looking forward to it!

2 Likes

did you meant “aggregation” here?

@aurelien.lair

That all depends on the context.