What's On Deck for Alerts: Sliding Windows Aggregation!

Hi folks!

As part of our new What’s On Deck? series, I want to share with you a feature we are actively developing and that will be coming to you soon. Follow the #whats-on-deck tag in general for new Alerts updates, and follow this thread specifically if you’d like to get updates on how close we are to release!

We will be releasing the ability to use Sliding Window Aggregation (SWA) in your NRQL alert conditions. Sliding Windows is something that is currently available in NRQL queries. I encourage you to read more on it in this documentation. We will be adding this functionality to Alerts soon, and I want to make sure you’re aware so that it won’t take you by surprise, and so that you can get excited for this new functionality.

As part of this improvement, we will also be increasing the maximum aggregation window size from 15 minutes to 120 minutes!

Why is this so cool?

  • It will allow for more consistent aggregation of erratic or volatile signals
  • More accurate and reliable alerting for infrequent or inconsistent signals
  • Ease of troubleshooting – you can duplicate sliding window behavior in ad-hoc NRQL queries
  • You can use aggregators other than sum

OK, how does it work?

The documentation on Sliding Windows in NRQL queries covers the basics, but I’ll quickly go over the formula we’ll be using (and that you can use too) to convert your “Sum of query results” alert conditions over to using SWA.

First of all, here’s the formula – you can reproduce this in NRQL for now, so I’m using TIMESERIES, which we do not normally allow in Alert conditions:

<your query> TIMESERIES <your threshold window> SLIDE BY <your aggregation window>

So if you have a threshold something like Sum of query results is over 100 at least once in 3 minutes, your threshold window is 3 minutes. Let’s assume you have an aggregation window of 1 minute (the default). This would result in the last 3 minutes worth of data being aggregated each minute.

Here’s an example of what that would look like. Imagine each block is 1 aggregation window of data, and inside the block is the aggregated value for that window. I’m going to use sum for my aggregator, since that just makes things easier to think about.

  • On minutes 1 and 2, no evaluation would take place. That’s because a buffer is being filled, and we do not yet have 3 minutes worth of data.
  • On minute 3, we now have a full buffer of data (3 minutes’ worth), so we can aggregate the values. The evaluated value for minute 3 would be 6 (1 + 2 + 3)
  • On minute 4, the 3-minute window slides by 1 minute, and a value of 9 is evaluated (2 + 3 + 4)
  • On minute 5, the 3-minute window slides by 1 minute again, and a value of 12 is evaluated (3 + 4 + 5)
  • and so forth

Will I be able to use TIMESERIES and SLIDE BY in my query?

We do have plans to allow this, but for now you will be able to use sliders in the UI to control these values. Keep in mind that your aggregation window (used for SLIDE BY) needs to be smaller than your threshold window (used for TIMESERIES), and the threshold window should be evenly divisible by the aggregation window.

Which brings us to …

Ways a slide-by condition can be broken

We plan to disallow these cases, but I want to make sure you all understand the why.

1st way to break your slide-by: use the same SLIDE BY setting and aggregation window

This is not a terrible way to break your condition, but it will make it so that you’re not really getting slide-by functionality.

If your SLIDE BY and aggregation window are the same value, you wind up actually getting the traditional alerts behavior. That is, instead of getting a nice, incremental slide, like this

You wind up with a “cascading” aggregation, which looks more like this

2nd way to break your slide-by: use a higher SLIDE BY setting than your aggregation window

This is a pretty terrible way to break your condition, since you will wind up with gaps which are not evaluated.

Let’s say you had your threshold window set to 3 minutes, but your aggregation window set to 6 minutes. That would look like TIMESERIES 3 minutes SLIDE BY 6 minutes.

You would wind up with behavior like this

3rd way to break your slide-by: use a SLIDE BY setting that does not divide evenly into your aggregation window

This is a somewhat terrible way to break your condition. Since your SLIDE BY setting is lower than your aggregation window, but will leave a gap once in a while.

Here’s an example: imagine a SLIDE BY setting of 60 seconds, and an aggregation window of 90 seconds. For the first minute, everything is good, but on every other minute, the SLIDE BY setting moves forward by half an aggregation window, which leaves half an aggregation window as a “gap” that does not get evaluated.

We will have validation in place to disallow these, but it’s important that you understand why.

You may think that SWA sounds familiar. That’s because I included this feature in my big announcement, which, I recommend checking it out at this link if you haven’t already.

Sliding Windows Aggregation, will be our replacement for the Sum of query results threshold type in NRQL alert conditions (documented here). This will be a gradual replacement, so that you will still be able to use your Sum of query results thresholds for a time after SWA is released.

What’s wrong with “Sum of query results” thresholds?

In a nutshell, they will only sum data. They won’t give you the maximum value, the minimum value, or the average value, they will only add data points together and give you a sum over a rolling time window. While this is certainly useful for some use-cases, there are many other cases where an average, min, max or some other aggregator is needed.

So … you’re saying Sliding Windows Aggregation will fix this?

Yes! When you use Sliding Windows Aggregation (SWA), you control how we aggregate the sliding window in your NRQL query when you use an aggregator function. If you use average, we will give you the average over the sliding window, instead of always the sum and only the sum.

Hey, this seems pretty cool! When is it available?

I can’t share an exact date with you yet, but I encourage you to try out the SLIDE BY function in ad-hoc NRQL queries to see how it works now!


Would it be correct to say that if we have a sum-of-query-results alert and it’s working exactly as we want, there’s no direct benefit to switching to SWA, but that it should also be a drop-in replacement?

(I do appreciate the extra flexibility SWA brings, and I’ll be checking existing uses of sum-of to see if they should be taking advantage of the new features!)

Hi @tmccormack

Would it be correct to say that if we have a sum-of-query-results alert and it’s working exactly as we want, there’s no direct benefit to switching to SWA, but that it should also be a drop-in replacement?

Yes, exactly. Keep in mind that “Sum of query results” alerts are deprecated and will be retired on 30 June, 2022. There is an in-UI tool to convert them one-by-one, including a view into what would change in the Terraform script and what the Nerdgraph mutation would look like for the new SWA condition. There is also a bulk conversion tool available.

Just before they are retired, we plan on converting all of the “Sum of” conditions which are left. However, we would prefer that you control the timing, since for the first X minutes (where X is your new aggregation window), the system will be building an aggregated view of the streaming data. During this period, no violations will fire. You can time this so that it has minimal affect on your business.

1 Like