Delayed Slack Alerts

Applied Intelligence Question Template

Hello. We have decided to give New Relic One ago and use it as a monitoring tool for our multi account aws production infrastructure as well as logging. I have created an alert which notifies me once a specific NRQL query returns something. I noticed that it took almost 30 minutes for the slack messafe to be delivered. I feel like something is wrong and I would like to know if there is a problem.

I appreciate if you could check the events and explain the 30 minute delay to me. Here is a link to the incident.

Thanks!!!

Alerts Incidents < New Relic One

@developer183

The tldr version; You will want Event Timer Aggregation method and Loss of Signal set to CLOSE

This condition is using Event Flow but gets data infrequently. Event Flow needs 2 data points before it will aggregate the data. If the ingested data is ‘gappy’ this can cause a long delay in incidents opening up. After 30 minutes, if a second point of data has not been received, the first point of data falls off into a ‘stale’ data category and is no longer considered. What this means is that if your data is not coming in pretty consistently or has large window of time where it may not send anything to New Relic then you want to use timer.

Choose your aggregation method
New aggregation methods for NRQL alert conditions
Relic Solution: How Can I Figure Out Which Aggregation Method To Use?

In addition I see that you are using count(*) for a value above 0 to open. This means you need a 0 to close. New Relic used to insert synthetic zeroes as a result for some queries, but ever since Streaming Alerts for NRQL conditions was introduced, synthetic zeroes are no longer inserted (the rules for when they were inserted was obtuse and opaque, so now you get exactly what the query returns).

If you have a threshold that requires a value of “0” in order to open or close a violation, it’s possible your query may not ever return a “0”. Rather, it may be returning the absence of a value, or null. A null response cannot be compared against a numeric threshold.

Our alerting system processes the results of your NRQL query through a strict order of operations. We first execute the FROM/WHERE portion of the query and determine whether any results match that criteria. If no results match that criteria, then the SELECT statement is not executed at all. This is when your query will return null. (The preview chart may show that your query is returning “0”, but the preview chart is not beholden to this order of operations.)

In order to open or close a violation on a complete absence of data, you will need to configure a loss of signal threshold, and check the box to either open a new violation or close existing violations when the query doesn’t return any results.

That being said, it is still possible for a query to return “0”. Here are a few circumstances in which that can happen:

  • If you move your query’s WHERE clause to the SELECT statement by using the filter() function, your query can return “0” if the FROM clause matches any data. For example, this query will return a “0” if there is no data from this particular monitor, but there is some SyntheticCheck data reporting:

FROM SyntheticCheck SELECT filter(count(*), WHERE monitorName = ‘My Monitor’)

Conversely, this query would return null, because nothing matches the FROM/WHERE portion:

SELECT count(*) FROM SyntheticCheck WHERE monitorName = ‘My Monitor’

  • If your data source is explicitly returning a value of “0”, aggregator functions that can return the explicit value of an attribute like latest(), max(), min(), and average() will process that as a “0” appropriately.

  • If you have an attribute that exists in your data but it has no corresponding value, OR 0 will allow you to turn the null into an explicit 0 (e.g. “myAttribute OR 0”).

For further reading, see this:

Since the condition needs a 0 to close this will result in an incident remaining open and future violations rolling up into it. Notifications are only sent when an incident open, closes, or is acknowledged. Violations do not send notifications. This means the previous incident would need to have closed for this violation to trigger a new incident.

What I would recommend with this condition (and possibly others like it) is to set up a Loss of Signal (LoS) for the same duration as the threshold, and configure it to close all open violations when met. This will result in violations that close automatically, as you’d expect them to.