Would you be able to look at the question I asked last week (above)?
Thomas, given the query that you have shared, you are correct that as long as some data is coming in for the data that matches the âwhereâ clause, then the loss of signal expiration timer will continue to be reset. However, if you were to add a âfacetâ clause to that query, then N number of individual time series signals would be created, and each would be separately streamed and evaluated. In that case, if one of the individual time series streams stopped receiving data, then that would trigger a loss of signal event.
Does it have ways to distinguishing between âsignal lossâ and âvery low-throughputâ?
In our previous ham-fisted attempted at alerting on this type of issue by monitoring the throughput, we still came across too many false-positives due to low-throughput apps.
Ravish,
Loss of Signal detection just works as a âdead man switchâ that gets reset when each data point arrives. If you have a low thoughput app, you will not want to set the loss of signal duration too low, and you may wish to use the âlast valueâ gap filling strategy to keep the evaluated value the same until the next data point changes it. Additionally, you may try making your aggregation windows longer. This will result in aggregation windows having data more often,
Thanks. Can I clarify - if nothing is returned by the WHERE clause, does that count as signal lost?
@bgoleno Can you please clarify regarding thomas.elderâs question-
If nothing is returned by the WHERE clause, does that count as signal lost?
Also, if the above happened for 5 consecutive minutes, and I have gap-filling strategy configured to fill in 0 for each such minute without data, will these still be counted as 5 minutes of lost signal?
The data points that make it past the âWhereâ clause is what goes to the streaming platform, and is considered the âsignalâ that streaming alerts sees. Every time streaming alerts sees a data point, it resets the Loss of Signal Timer. If you have a service that sends a true | false status beacon every 15 seconds, and it only sends âfalseâ very sporadically , if your query is a " count(*) WHERE status = âfalseâ " , then that alert may open a violation if the count exceeds the threshold. When the status returns to âtrueâ, there would be no incoming data events to trigger the evaluation to close the violation. One would need to use Loss of Signal to close that violation.
If you have gap filling as static:0 , and you have a Loss of Signal expiration.duration of 10 minutes, then if there is a 6 minute gap between data points, when that second data point comes in, before evaluation is triggered, the empty aggregation windows would be filled with a â0â , and the data will be evaluated.
If the Loss of Signal duration timer expires first, then we will consider that signal as lost, we will trigger whatever action you define, and clear out the buffers.
At what time today are the New Relic One alerts being enabled? Did it already happen?
We started the incremental roll out at 6:00 am this morning (pacific) . The complete rollout process will take about a week. When your account is enabled, you will see a banner across the top that no longer says âopt-inâ
Have to say, Iâm surprised by the lack of mention of the dedicated Loss of Signal Alerts Migrator app in this thread; the tool seems very pertinent to this chat and I only just came across it due to a single mention in an old email.
This causes a lot of headache for us because it is incredibly confusing that a count(*)
with threshold of 1 on âedge-triggered failure eventsâ never gets a value of 0 to reset.
Weâre having a similar issue where trying to alert on âedge-triggered failure eventsâ (like a Kubernetes container restarting) may not work anymore. Ex:
SELECT max(restartCount)-min(restartCount) FROM K8sContainerSample WHERE containerName = '${var.deployment}' and deploymentName = '${var.deployment}' and namespace = '${var.k8s_namespace}' facet podName
That used to alert whenever a podâs container changed its restartCount but now seems to just cause âloss of signalâ constantly. We disabled the loss of signal warning on the alert but not sure if that fixed the alert to work again or just made it do nothing.
It looks like most of my alerts were fixed by using the different gap filling methodologies. However, all of my alert conditions are creating false positives when the NRQL condition is no longer within the time range that I have setup. For example, any alert that uses a condition like:
hourOf(timestamp) in ('7:00','8:00','9:00','10:00','11:00','12:00')
ends up shooting out an alert right at the end of the time range. In this case, I would get an alert at around 12:01. This did not happen before. Any thoughts on how to prevent this behavior? I have already messed with different aggregation window and offset values with no success.
Weâve had to disable the âloss of signalâ condition from some alerts, or whether they open or close on loss of signal, to get them back to functional. In your case that could work as well, with the downside being that you wouldnât be able tell if you lost signal inside the window you do care aboutâŠ
Accounts that have NRQL Conditions that may be monitoring for loss of signal will be enabled on October 28th. These are NRQL Conditions that either use the âLess Thanâ operator, or have an operator and threshold of âEquals 0â.
So here it is said that NRQL conditions that either uses â<â or â=â to operator and threshold value of 0.
What is the case with â>â operator and not equal to zero threshold value NRQLâs conditions? Is there any difference?
NRQL alert conditions appear to be 1st class citizens are far as automation goes, per Terraform documentation:
The newrelic_nrql_alert_condition resource is preferred for configuring alerts conditions. In most cases feature parity can be achieved with a NRQL query. Other condition types may be deprecated in the future and receive fewer product updates.
âTerraform New Relic, Resource: newrelic_alert_condition
I wanted to focus in on the âfeature parityâ piece. How can I alert on baseline deviations across applications?
For example, I can create an APM > application metric baseline alert condition for response time and select a number of entities across my account. Easy. But how to replicate that in NRQL alert condition? Closest Iâve got is:
FROM Transaction SELECT average(duration) FACET appName
However, the following error message is returned when threshold type âBaselineâ is selected:
Enter a valid NRQL query above to see a threshold preview.
Baseline threshold type is not applicable for faceted queries.
Thatâs frustrating and inconsistent. The NRQL is valid, as proven by its output in the query builder and itâs inconsistent with the experience for both âStaticâ and âOutlierâ threshold types: both of which function as expected.
Any suggestions would be appreciated, or if this should be addressed in a separate thread. Thanks.