Announcing: New Relic One Streaming Alerts for NRQL conditions

@bgoleno

Would you be able to look at the question I asked last week (above)?

1 Like

Thomas, given the query that you have shared, you are correct that as long as some data is coming in for the data that matches the “where” clause, then the loss of signal expiration timer will continue to be reset. However, if you were to add a “facet” clause to that query, then N number of individual time series signals would be created, and each would be separately streamed and evaluated. In that case, if one of the individual time series streams stopped receiving data, then that would trigger a loss of signal event.

Does it have ways to distinguishing between “signal loss” and “very low-throughput”?

In our previous ham-fisted attempted at alerting on this type of issue by monitoring the throughput, we still came across too many false-positives due to low-throughput apps.

Ravish,
Loss of Signal detection just works as a “dead man switch” that gets reset when each data point arrives. If you have a low thoughput app, you will not want to set the loss of signal duration too low, and you may wish to use the “last value” gap filling strategy to keep the evaluated value the same until the next data point changes it. Additionally, you may try making your aggregation windows longer. This will result in aggregation windows having data more often,

@bgoleno

Thanks. Can I clarify - if nothing is returned by the WHERE clause, does that count as signal lost?

@bgoleno Can you please clarify regarding thomas.elder’s question-
If nothing is returned by the WHERE clause, does that count as signal lost?
Also, if the above happened for 5 consecutive minutes, and I have gap-filling strategy configured to fill in 0 for each such minute without data, will these still be counted as 5 minutes of lost signal?

The data points that make it past the “Where” clause is what goes to the streaming platform, and is considered the “signal” that streaming alerts sees. Every time streaming alerts sees a data point, it resets the Loss of Signal Timer. If you have a service that sends a true | false status beacon every 15 seconds, and it only sends “false” very sporadically , if your query is a " count(*) WHERE status = ‘false’ " , then that alert may open a violation if the count exceeds the threshold. When the status returns to ‘true’, there would be no incoming data events to trigger the evaluation to close the violation. One would need to use Loss of Signal to close that violation.

If you have gap filling as static:0 , and you have a Loss of Signal expiration.duration of 10 minutes, then if there is a 6 minute gap between data points, when that second data point comes in, before evaluation is triggered, the empty aggregation windows would be filled with a “0” , and the data will be evaluated.
If the Loss of Signal duration timer expires first, then we will consider that signal as lost, we will trigger whatever action you define, and clear out the buffers.

1 Like

At what time today are the New Relic One alerts being enabled? Did it already happen?

We started the incremental roll out at 6:00 am this morning (pacific) . The complete rollout process will take about a week. When your account is enabled, you will see a banner across the top that no longer says “opt-in”

Have to say, I’m surprised by the lack of mention of the dedicated Loss of Signal Alerts Migrator app in this thread; the tool seems very pertinent to this chat and I only just came across it due to a single mention in an old email.

1 Like

This causes a lot of headache for us because it is incredibly confusing that a count(*) with threshold of 1 on “edge-triggered failure events” never gets a value of 0 to reset. :frowning:

We’re having a similar issue where trying to alert on “edge-triggered failure events” (like a Kubernetes container restarting) may not work anymore. Ex:

SELECT max(restartCount)-min(restartCount) FROM K8sContainerSample WHERE containerName = '${var.deployment}' and deploymentName = '${var.deployment}' and namespace = '${var.k8s_namespace}' facet podName

That used to alert whenever a pod’s container changed its restartCount but now seems to just cause “loss of signal” constantly. We disabled the loss of signal warning on the alert but not sure if that fixed the alert to work again or just made it do nothing.

It looks like most of my alerts were fixed by using the different gap filling methodologies. However, all of my alert conditions are creating false positives when the NRQL condition is no longer within the time range that I have setup. For example, any alert that uses a condition like:

hourOf(timestamp) in ('7:00','8:00','9:00','10:00','11:00','12:00')

ends up shooting out an alert right at the end of the time range. In this case, I would get an alert at around 12:01. This did not happen before. Any thoughts on how to prevent this behavior? I have already messed with different aggregation window and offset values with no success.

We’ve had to disable the “loss of signal” condition from some alerts, or whether they open or close on loss of signal, to get them back to functional. In your case that could work as well, with the downside being that you wouldn’t be able tell if you lost signal inside the window you do care about…

Accounts that have NRQL Conditions that may be monitoring for loss of signal will be enabled on October 28th. These are NRQL Conditions that either use the “Less Than” operator, or have an operator and threshold of “Equals 0”.

So here it is said that NRQL conditions that either uses “<” or “=” to operator and threshold value of 0.
What is the case with “>” operator and not equal to zero threshold value NRQL’s conditions? Is there any difference?

NRQL alert conditions appear to be 1st class citizens are far as automation goes, per Terraform documentation:

The newrelic_nrql_alert_condition resource is preferred for configuring alerts conditions. In most cases feature parity can be achieved with a NRQL query. Other condition types may be deprecated in the future and receive fewer product updates.
—Terraform New Relic, Resource: newrelic_alert_condition

I wanted to focus in on the “feature parity” piece. How can I alert on baseline deviations across applications?

For example, I can create an APM > application metric baseline alert condition for response time and select a number of entities across my account. Easy. But how to replicate that in NRQL alert condition? Closest I’ve got is:

FROM Transaction SELECT average(duration) FACET appName

However, the following error message is returned when threshold type “Baseline” is selected:

Enter a valid NRQL query above to see a threshold preview.
Baseline threshold type is not applicable for faceted queries.

That’s frustrating and inconsistent. The NRQL is valid, as proven by its output in the query builder and it’s inconsistent with the experience for both “Static” and “Outlier” threshold types: both of which function as expected.

Any suggestions would be appreciated, or if this should be addressed in a separate thread. Thanks.