Relic Solution: How Can I Figure Out When To Use Gap Filling and Loss of Signal?

Hi friends! I’m writing this about a month after the Streaming Alerts Platform started rolling out, and I wanted to proactively answer some questions that are coming up a lot in support tickets. If you’d like to learn more about these new features, read on!

First of all, let’s make sure to provide some documentation links here:

Now that you have some documentation to refer back to, let’s take a step back to discover when you’ll want to use these new settings. Before you can answer this question, you need to understand…

Query order of operations

By default, the aggregation window is 1 minute. You can change that, but let’s move forward with the assumption that you’re using the default setting.

Every minute, a window of collected data is aggregated using the function in the NRQL alert condition’s query. The query is parsed and executed by our systems in the following order:

  1. FROM clause – which event type needs to be grabbed?
  2. WHERE clause – what can be filtered out?
  3. SELECT clause – what information needs to be returned from the now-filtered data set?

Let’s take an example query and see what this means in practice.

SELECT count(*) FROM SyntheticCheck WHERE monitorName = 'My Cool Monitor' AND result = 'FAILURE'

Let’s say that, for this minute, there are no failures.

The system would first grab all of the SyntheticCheck events on your account (FROM clause). It then filters through that mountain of events, looking only for the ones that match the monitor name and result that I’ve specified (WHERE clause). Once that is done, and this is very important

If there are no events left after the first two steps, the SELECT clause will not be executed.

This means that aggregators like count() and uniqueCount() will never return a zero value. When there is a count of 0, the SELECT clause is ignored and no data is returned, resulting in a value of NULL.

In the past, in some cases, New Relic used to insert synthetic zeroes to cover over those NULL values, and in other cases would let the NULL value stand. With the Streaming Alerts Pipeline, now New Relic never inserts a synthetic zero. You now have the power to configure what is done with all of those NULL values.

Does this mean I can never get a zero value in a NRQL alert condition?

Not at all! If you have a data source delivering legitimate numeric zeroes, the query will return that. Let’s look at that sort of example. Imagine that myCoolAttribute is an attribute which can sometimes be equal to 0:

SELECT average(myCoolAttribute) FROM MyCoolEvent

If, in the minute that is being evaluated, there is at least one myCoolEvent event and if the average value of all myCoolAttribute attributes from that minute is equal to zero, then a 0 value will be returned, not a NULL.

However, if there are no MyCoolEvent events during that minute, then a NULL will be returned (because of the order of operations).

OK, I think I get it. Now what’s the deal with Loss of Signal and Gap Filling?

Loss of Signal (LoS) and Gap Filling (GF) allow you to determine how New Relic’s Alerts Evaluation Service handles any NULL values that are returned by your query.

Loss of Signal: If you’re using a Synthetics query like the one I used as an example above, you’ll wind up getting a violation when there is a failure or series of failures. However, the violation will seem to never close! That’s because a 0 can never be returned by a count and a NULL can’t be evaluated numerically. However, if you already understand this behavior, you can plan for it by setting up a LoS. If I set up a LoS with a 10-minute window and check the box labeled Close all current open violations, then my open violation from getting that failure will only close once I’ve had 10 minutes with no further failures.

Keep in mind that Loss of Signal needs at least one non-NULL data point to kick in. If you create a new condition (or edit and save an old condition) that has nothing but NULL values, your LoS behavior won’t kick in until after a numeric data point gets evaluated.

Gap Filling: Gap fill pretty much does as it says on the tin – when a gap is detected, it will insert the values you specify. Those can be either all 0s, all some other static value, or whatever the last numeric value reported was.

Here’s the thing with Gap Filling: it needs to detect a gap before it kicks in. That means that there need to be two numeric data points separated by some stretch of NULL data points. Until that 2nd numeric data point shows up, the stretch of NULL values will possibly become a LoS if it lasts long enough to trip the LoS setting. So you need to have a start and an end to a gap. If the gap is shorter than your LoS setting (or if you have no LoS setting), once the 2nd data point shows up, the values you’ve specified will get inserted. This could lead to a violation opening or a violation closing, but it won’t happen until that 2nd numeric data point shows up.

If you would like to know more about use-cases, and when to use these different aggregation methods, please head over to this article.

Phew! That was a lot to cover! If any of this is unclear, please post your questions below!

Relic Solution: How Can I Figure Out Which Aggregation Method To Use?
NRQL Alert- Signal Loss/count and WHERE clause
NRQL Alert not triggering despite condition being met and graph showing critical violation
Incident wasn't closed when there is no more violation happen
Feature Idea: APM - JVM up/down monitoring
Alert condition hit critical and didn't send alert
Alert policy threshold to be more than 120 Minutes
NRQL not triggering alert
Alert did not work
Alerts not getting generated when I see a critical violation
Condition not creating an incident or sending out a notification
Big News For Alerts and Applied Intelligence
Alert based on custom events not triggering violations
Alert when any host stops processing transactions
Alert when any host stops processing transactions
Log Message NRQL Alert not creating incident despite preview violations
Incident not closing even if violations are
Policy is not opening an incident
NRQL Query does not send all notifications
Alert limitation on Synthetics monitor
My NRQL alert is not closing automatically
Create an alert for a daily lambda
Incidents are not closed automatically
Alerts not getting auto resolved
How to create an alert to get notified the moment one single error happens
Chart to show absence of transactions
Process Running infra condition no worky when ProcessSample tuned with include_matching_metrics
Best way to trace specific customer issues on alert
Tracking for aliveness of the data from external services
Alerts Troubleshooting Framework False Alerts
Gap filling strategy for indefinite window gaps
NRQL Alert not triggering despite condition being met multiple times
Signal Lost Alerts with AWS MSK (Managed Kafka) integration
Detecting AWS AutoScaling Groups thrashing
Generate an alert for each pod in pending status
End of life for "Sum of query results" thresholds
Infrastructure alert for process not running not triggering
Report devices that haven't been reporting data since X number of days ago
Monitoring Transactions for more than 24 hours
Alert not triggering. Pls help
Alerting not closing when expected
Alert not closing after triggering condition stops
NRQL alert not creating incident
Converting "Sum of" thresholds to use Sliding Windows Aggregation
The NRQL alert can't open incident
Open violation not closing automatically when alert condition is not met, have to close manually
Delayed Slack Alerts
The alert doesn't send email notification after a critical violation
Alert condition hit critical and didn't send alert
Big News For Alerts and Applied Intelligence
I have a Synthetic script and I want to trigger an email and ticket when the script is success instead of failure
Not Incidents Not Opening
Alerting in missing daily traffic

NOTE: If you need a 0 value in your alert condition for whatever reason, you can work around the query order of operations detailed in the first part of the above article by using this method.

Include your filtering WHERE clause in a filter sub-clause in your query. Here’s an example:

SELECT filter(count(*), WHERE result = 'FAILED') FROM SyntheticCheck

You can adjust this to work with most queries, just remember to include all of your filter elements inside the parentheses in the filter sub-clause.

Since there are definitely results for SyntheticCheck (or, presumably, whichever event type you select), the SELECT clause gets run and will return a 0 if there are no events that match the filters.

You could include filters outside of the parentheses, as well, but keep in mind that those filters may stop the SELECT clause from running, so in those cases you’ll still need to have a Loss of Signal setting to cover times when NULL is returned by the query.


2 posts were split to a new topic: Issue with open and closed incidents