End of life for "Sum of query results" thresholds

UPDATE! “Sum of query results” bulk conversion tool is now available! See this post.

Hello folks!

Recently, we released Sliding Windows Aggregation (SWA) alert conditions. You can read my announcement about that at this link. If you haven’t yet tried them out, I encourage you to do so! I think they’re really cool :slight_smile:

SWA is a great replacement for “Sum of query results” (or “Sum of”) thresholds. It’s more versatile and it allows you more control. With SWA, you control the size of the aggregation window, the “slide by interval,” and the threshold duration. In addition, with SWA, you can use other aggregation functions besides sum (average, min and max, for starters).

As mentioned in this post from last October, we’ll be EOLing “sum of” alert conditions in favor of SWA on 30 June, 2022. In addition, starting in early March (no later than 3 March) we will disable the creation of new “Sum of” alert conditions. If you would like to convert your “sum of” conditions manually to SWA, this article goes into detail on how you would go about doing that.

If you would rather not do this manually, we will have a conversion helper tool in the UI, available in every “sum of” alert condition, that will allow conditions to be converted one-by-one. This should be available very soon, although it may already be in the UI as you read this! Starting in early March (no later than 3 March) we will disable the creation of new “Sum of” alert conditions via the UI. You will still be able to create new ones via the API, only so we don’t inadvertently break your automation. On 30 June, we will turn the API functionality off as well.

In addition, we will make available a bulk edit tool which will list out all of your “sum of” conditions and allow you to convert them one at a time, all at once, or anything in between.

4 Likes

Thank you. Is there a NerdGraph query (or other solution) for finding all affected alerts?

1 Like

Hi @rraposa

There is not a NerdGraph query per se, you would need to use a script to filter out all the conditions with a value function of SUM.

That said, we are hard at work right now on a tool that will list out all of your “sum of” alert conditions and allow you to convert them one at a time or all at once. The current build of this tool will just list out your “sum of” conditions and provide links to them, but won’t do the conversion yet – that will come later. However, with the list you could use the per-condition converter that is already in the UI or you could use the manual steps listed above.

We’re working on getting the current build of this tool in the New Relic app catalog right now.

1 Like

Hi @Fidelicatessen,
Where can we find the tool/when will the tool be available?

Hi @stijndd

You will find it in the New Relic App Catalog.

HI @Fidelicatessen,

When will this be available?

Hi @stijndd

We are fixing some final bugs in it, and we hope to have it available by the end of this week. If that fails, we are holding ourselves to a hard deadline of March 31.

Thanks for being patient!

Sorry to be a bother. Is this now available? Thank you again for all the updates!

1 Like

@Justin_Page

Yes, it’s available now! Take a look at this announcement.

2 Likes

If we want an alert on 3 failures in a particular synthetic, how we will do that without sum of query:
SELECT count(*) FROM SyntheticCheck WHERE monitorId = ‘ea34badf-2875-4140-bd78-f090f1c2f861’ OR monitorId = ‘b3e00bd6-6bdf-4257-83a3-81b332900c59’ WHERE result=‘FAILED’ FACET monitorName

1 Like

@Srishti.Singla

If we want an alert on 3 failures in a particular synthetic, how we will do that without sum of query

Use Sliding Windows Aggregation (SWA). It does all the same things “Sum of query results” did, but gives you more control over aggregation method and timing.

For your use-case, I would recommend using the frequency of your Synthetic monitor as the window duration. So for example, if frequency is 5 minutes, use a window duration of 5 minutes and a slide-by interval of 1 minute. Your count will be combined over a rolling 5 minutes, and you can set your threshold to open a violation when you have a result of 3 or more.

Please do take a look at the links I provided at the top of this thread for more details about SWA.

2 Likes

Thanks, @Fidelicatessen, I have the same use-case as @Srishti.Singla.

To expand further, my ping monitors have a frequency of 1 minute and I’d like to trigger an alert if any of them fail 5 times or more (ignoring synthetic multilocation alerts for the moment). Is the following NRQL appropriate (using count instead of sum)?

  FROM SyntheticCheck
SELECT count(result)
 FACET monitorName
 WHERE result = 'FAILED'
   AND monitorName LIKE '%Ping%'

How should the following fields be configured in order to fulfil the given use-case which is currently handled by “sum of query”?

Hi @rishav.dhar, nice to see you again.

my ping monitors have a frequency of 1 minute and I’d like to trigger an alert if any of them fail 5 times or more

5 times or more in a single minute? 5 times or more in 5 minutes? 10 minutes? Let’s call this value Y.

With a Window Duration of 1 minute and a slide-by interval of 30 seconds, you are aggregating values over 60 seconds, and sliding forward every 30 seconds.

I would recommend changing the window duration to Y. This will aggregate the count of failures over Y minutes. You can then use some factor of that to be your slide-by interval, and you would need to determine how many of those intervals would need to pass with the count being 5 or more.

For example, Window duration of 5 minutes, slide-by interval of 1 minute, and a threshold of “for at least 1 minute” or “at least once in 1 minute” would mean that as soon as 5 failures are seen in a rolling 5 minute window, the alert would fire.

This is only an example, though, and really depends on what you need for your use-case. Just remember: window duration determines the period over which the data will be aggregated (aggregation window), slide-by interval determines how often that value is refreshed (or how often the window moves forward), and your threshold duration simply determines how long you want to wait for the condition to fire (if you’re using “for at least”) and how long you want to wait for any open violation to clear (if you’re using “at least once in” or “for at least”).

I hope this helps!

2 Likes

Hello. During our Demo of this tool I noticed that we do not see why a alert does not migration - just a list of policy’s that did fail. It would be helpful to know why it did not migrate. We have tried to migrate one policy to test the feature ( so I know it is not “time” or “resource” bound.

Hi @Brenda.Stensland1

If you take a look at my post that documents the bulk conversion tool, I explain why a condition would fail to convert using the tool:

Window duration must be a multiple of the slide by interval. If we can’t find values that match your existing values while also obeying this rule, the settings will need to be reviewed and changed manually by a user.

Keep in mind that the tool batches requests so several may fail because they were included in a batch with a single condition that can’t be converted. The tool includes a retry option that will let you narrow down that number so you can find the condition that actually needs to be addressed manually.

If you do wind up with a condition that needs to be manually converted, this article goes into detail on how to do that (see the bottom of the article for a step-by-step). Note that this process is exactly the same logic that is used in the bulk conversion tool, as well as the single-condition conversion button shown in the condition edit UI.

1 Like

Thanks for the detail, @Fidelicatessen, bear with me as I try to accommodate the shift from EOL “Sum of query” to SWA instead.

  • Ideally alert on 5 failures in a row—but if that’s not possible with NRQL—then Y is 30 minutes.
  • Just aim to alert as soon as the monitor fails 5 or more times.
  • Does it matter if I use sum() or count() in the NRQL?
  • Does it matter if I use “at least once in” or “for at least” for my use-case?

I’ve drawn up the below with my understanding so far. Would you mind confirming if it fits the bill for what I’m looking?

Hi @rishav.dhar, I’m happy to help here

Ideally alert on 5 failures in a row—but if that’s not possible with NRQL—then Y is 30 minutes.

Change your window duration to 30 minutes, then.

Just aim to alert as soon as the monitor fails 5 or more times.
Does it matter if I use “at least once in” or “for at least” for my use-case?

If you want an incident to open as soon as 5 failures happen, use “at least once in X minutes,” where X is your slide by interval. Currently you have this set to 1 minute, which should work great, even once you change window duration to 30 minutes.

Does it matter if I use sum() or count() in the NRQL?

Using either will give you a total count over your window duration – it shouldn’t matter which you use.


Let’s talk about your Loss of Signal setting, now. “Signal is lost after 2 minutes” means that 2 minutes’ worth of successes will result in a “Signal lost” event that will close any open incident. If you want an open incident to close that soon, this is perfect. However it may lead to flapping if 5 failures have still occurred in the last half hour. I would recommend setting this to 30 minutes to reduce potential flapping behavior. That would equate to 30 minutes with no failures.

2 Likes

Many thanks, @Fidelicatessen. These are pretty tricky to configure just right so your help is super appreciated.

Especially for going as far clarifying the Loss of Signal attribute in this context, as we’d have been just as lost there as well.

1 Like

I changed my mind on the answer I gave before.

sum() is meant to add together numeric values.

You want to use count(*), since you’re literally counting the number of times something happens.

Am I right in thinking that Terraform templates that don’t use the already deprecated value_function are not affected by this?