Gap filling strategy for indefinite window gaps

Trying to understand they way the gap filling strategy is supposed to work. My understanding so work is New Relic looks for atleast one data sample in the set aggregation window duration.

Now, lets take a scenario where I have a custom integration which sends either a PASS/FAIL. We need to trigger an alert condition when we detect atleast 4 or more FAILS in a duration of 10 minutes.

We check count(*) of samples that are reporting ‘FAIL’ and trigger an alert with threshold as greater than 3 and duration as 10 minutes.

Ideally, we have
0 th minute: FAIL
2 nd minute: FAIL
4 th minute: FAIL
6 th minute: FAIL
8 th minute: FAIL

We would have an alert triggered successfully.

Now, lets consider a scenario where we miss out on some samples.
0 th minute: FAIL
2 nd minute: NULL
4 th minute: NULL
6 th minute: NULL
8 th minute: FAIL

If we were to enable the gap filling functionality, with the ‘previous value’, Would the gap of all the 3 windows be filled with a ‘FAIL’, and would an alert be triggered?

Hi @kapai

If you haven’t seen it, I would highly recommend reading through this article. It goes into a fair amount of detail about how Gap Filling and Loss of Signal work.

There are a couple ways you might work this.

If you want to know how many failures you’ve had in 10 minutes, you could use a 10-minute Aggregation Window, which would simply combine all failures into the count when it runs an evaluation once every 10 minutes. However, that won’t combine results from neighboring buckets. That is, if you have 4 failures come in consecutively but they span two 10-minute aggregation windows, that would not be seen as a threshold breach. Because of this, I do not recommend this method.

A different (and possibly better) way to work this would be to use Sum of query results in your threshold (see this documentation if you’re unfamiliar). This would keep a rolling sum of the results over the course of X minutes (where X is your threshold time window). So long as there is even one numeric value in the most recent 10-minute span, a numeric result will be carried forward. If that sum breaches your threshold even once, a violation is raised.

We check count(*) of samples that are reporting ‘FAIL’ and trigger an alert with threshold as greater than 3 and duration as 10 minutes.

You would absolutely need to use Sum of query results for this to work, or use a 10-minute aggregation window, as every aggregation window is evaluated discretely and the results are not compared with neighboring windows by default. The example you provided would not result in a violation unless Sum of... were used (or a 10m aggregation window).

This makes Gap Filling redundant, although you would still want to use Loss of Signal to ensure that your violations close when everything settles back down to normal. This is because a string of NULL values will not close a violation unless you’ve configured Loss of Signal to recognize that string of NULL values and take action to close any open violations.

Gap Filling only works if there are 2 numeric data points separated by a gap of NULL values. Once the 2nd data point comes in, the intervening NULL values are retroactively replaced with whatever you’ve configured. This can result in a violation opening “late” – but really, it opens as soon as the system knows about it, given that NULL data points were retroactively filled with potentially threshold-breaching data.

Now, lets consider a scenario where we miss out on some samples.
0 th minute: FAIL
2 nd minute: NULL
4 th minute: NULL
6 th minute: NULL
8 th minute: FAIL
If we were to enable the gap filling functionality, with the ‘previous value’, Would the gap of all the 3 windows be filled with a ‘FAIL’, and would an alert be triggered?

Only after the data for the 8th minute was evaluated, and only if a Sum of... threshold were being used (a 10-minute aggregation window, in this case, would only see 2 failures).

I hope this helps!

2 Likes