Announcing: New Relic One Streaming Alerts for NRQL conditions

New Relic is rolling out a new, unified streaming alerts platform for New Relic One. This new streaming alerts platform will power NRQL Alert Conditions, and over the next year, all alert condition types will be consolidated into NRQL conditions.

New Relic One Streaming Alerts delivers:

  • More reliable alerting that is far less susceptible to data latency and processing lag.
  • Increased accuracy of the data points that are being evaluated
  • Reduced time-to-detect through improvements in the streaming algorithm, and configurable aggregation duration.
  • Greater control over the signals being monitored. You can specify how to evaluate signal gaps, when to consider a signal as lost, and what actions should be taken.
  • Consistent behavior and configuration of Alert conditions regardless of the telemetry type, source of the signal being monitored, or specifics of your NRQL query.
  • Increased scalability in the number of time series that an Alert Condition can monitor and in the total number of conditions that can be configured

Opt-in Migration
When we roll out this new streaming platform, there is a change in behavior related to how we process aggregation time windows that do not have data. If you are monitoring for when a signal goes to “0” in order to determine if an entity stops reporting, this approach will no longer work after moving to the new platform. To maintain this functionality you must enable Loss of Signal detection on these conditions in advance of moving your account in order to prevent false negatives. You may opt-in to this new platform now. Read more about the rollout plan in the FAQ section below.

Increased Reliability and Accuracy
This new streaming platform upgrades the streaming algorithm to an event-based mechanism that uses the incoming data points to move the streaming aggregation windows forward. The current model uses the clock on the server to trigger aggregation. With the new approach, an aggregation window will wait until the related data points arrive, thus greatly decreasing any negative effects that may be caused by lag in a data stream. This will also greatly reduce the alert latency and improve accuracy for Cloud Integrations that use a polling based integration.

Configurable Gap Filling Strategies
Not all signals or time series that are being monitored have a consistent flow of data points. The streaming alerts platform evaluates time windows of a specified duration. In many cases, the telemetry signals you send to New Relic will have gaps, meaning that some time windows will not have data. With the new streaming platform, you can specify how we should evaluate those gaps. You can also set different gap filling strategies, sometimes called extrapolation strategies, for each alert condition.

Loss Of Signal Detection
The NR One Streaming Alerts Platform now provides official support for Loss of Signal Detection. While there are workarounds to achieve this in the current platform, they are inconsistent, and the shift to an event based streaming algorithm disables that workaround. With configurable Loss of Signal Detection, on any NRQL Alert Condition, you simply specify how many seconds we should wait from the time we saw the last data point before we consider that signal to be lost. Once that time expires, you can choose to be notified of the Loss of Signal, or you can simply close any open violations if you expect the entity or signal to go away.

Faster alerts (Sub-minute time-to-detect)
With the NR One Streaming Alert Platform, all telemetry data can be evaluated in sub-minute timeframes. We will allow you to configure the aggregation duration down to as low as 5 seconds, and increase it to a maximum of 15 minutes. This, combined with the benefits of the event-driven streaming algorithm will allow you to achieve sub-minute time-to-detect while increasing both accuracy and reliability. Depending on your data configuration and the requirements of your scenario, you can achieve a time-to-detect as low as 10-15 seconds.

--------- Frequently Asked Questions. -----------

Q: When is this available?
A: You can Opt-in to enable New Relic Streaming Alerts on NRQL conditions now.
We plan to enable the majority of accounts the week of October 5th.
Accounts that have NRQL Conditions that may be monitoring for loss of signal will be enabled on October 28th. These are NRQL Conditions that either use the “Less Than” operator, or have an operator and threshold of “Equals 0”.

Q: How do I request to have our account(s) enabled?
A: Simply complete this form: https://sgnf.typeform.com/to/FkUEMwBP
We will be enabling accounts in batches on Tuesdays, Wednesdays, and Thursdays.
Please specify when you would like for your accounts to be enabled, and let us know if you have questions. You may also discuss this with your account team.

Q: How will I know if my account has been enabled.
A: When we roll this out the week of 10/5, there will be a banner on the Policies page and the NRQL Condition create/edit page. If your account is not enabled, the banner will ask you to enable New Relic One Streaming Alerts, and link you back to this document.

Q:Is there any Documentation?
A: Yes. An overview of Loss of Signal and Gap Filling Strategies, along with how to configure them in graphQL is documented here: NerdGraph API: Loss of signal and gap filling .
Additional documentation will be published very soon, and this section will be updated

Q: How do I manage these features?
A: You can configure these features on NRQL Conditions using the UI, GraphQL API for NRQL Conditions, and the REST API for NRQL Conditions.

Q: Can I configure these settings before having the new streaming platform enabled?
A: Yes, if you are opting in before 10/5, we can enable the UI for you before you enable the account. This will allow you to update your NRQL conditions, if needed, before the features are enabled. After the week of 10/5 , all accounts will have access to the UI and APIs. If your account is not enabled during that week, you can use the UI and API to update any alert conditions before having these new features enabled.

Q: Will the NR One Streaming Alerts Platform cover all alerting services?
A: Only NRQL Conditions will receive the full set of New Relic One Streaming Alerts functionality. APM, Infrastructure, and Synthetics alerts will be migrated to NRQL Conditions over the course of the year.

Q: Are all of the features mentioned above available?
A: Gap Filling and Loss of Signal Detection are available now. The remaining features, configurable aggregation duration and the event based streaming algorithm will be released incrementally throughout the rollout period.

Q: Will this eliminate false positives.
A: No, but this should greatly reduce false positives. Eliminating false positives and false negatives is an audacious goal that all alerting engines continuously combat and one we continue to work toward.

Additionally, Loss of Signal Detection is monitoring for the absence of data for a period of time. Whenever clock time is involved, there is a higher chance of false positives when there is significant disruption to the flow of data. If there is known latency within the New Relic platform, we take that into consideration, but that does not address all possible signal disruptions between the data collection source and the New Relic One Streaming Alert Platform.

Q: I have more questions, how can I get answers ?
A: Please reach out to your account teams if you have questions or concerns.
Alternatively, you can ask questions in the discussion area below, and a New Relic community leader will answer. For a deeper dive into what is new, and how to best use these new features, sign up for New Relic Nerd Days on October 13, and check out the Alerts session at 2:00 PM PST. I will share the recording here afterwards.

8 Likes

Nice. Is it correct to assume that there is also an agent version dependency to obtain Sub-minute time-to-detect functionality?

Colonel Steve Austin,
Yes, for true, end-to-end sub-minute time-to-detect, you will need an agent - whether a New Relic agent or an open-source agent, that has a short flush cycle.

1 Like

@bgoleno

If I’m not mistaken the min NR agent versions required for this function will be as such:

image

I think it would be helpful to note the dependency as customers interested in making use of the new functionality may also need to upgrade agents.

4 Likes

We released the REST API controls for these new functions on 09/25/2020.
If you are using the REST API for NRQL Conditions to manage your alerts, PLEASE update your scripts with the new functions.

Am I looking at the right page of the API docs for this new implementation?

for the REST documentation, yes.
these are the parameters:
},
“signal”: {
“aggregation_window”: “string”,
“evaluation_offset”: “string”,
“fill_option”: “string”,
“fill_value”: “string”
},
“expiration”: {
“expiration_duration”: “string”,
“open_violation_on_expiration”: “boolean”,
“close_violations_on_expiration”: “boolean”
}

GraphQL Docs and explanatory content is here: https://docs.newrelic.com/docs/alerts-applied-intelligence/new-relic-alerts/alerts-nerdgraph/nerdgraph-api-loss-signal-gap-filling

1 Like

I noticed the evaluationoffset is being moved under the ‘signal’ container. Does this mean there’s a major version bump coming to the NerdGraph APIs? This will definitely be a breaking change for me.

Fear not @shaun.titus. It exists in both places for backwards compatibility. The place that you are using is marked as deprecated, but still supported. When you make future updates, like to support the new loss of signal and gap filling strategies, you may wish to update this location.

1 Like

Can you please clarify what’s the relationship and differences between expirationDuration and thresholdDuration?

For example, in your loss of signal documentation page, https://docs.newrelic.com/docs/alerts-applied-intelligence/new-relic-alerts/alerts-nerdgraph/nerdgraph-api-loss-signal-gap-filling#h3-create-a-new-condition-with-loss-of-signal-settings, it says an alert will be created if no signal is received for 2 min, which matches the 120 seconds for expirationDuration setting, but what about the thresholdDuration: 600 value on the same example?

Thank you

1 Like

Hi @Tiago.DeConto

I answered this question in your support ticket as well, but I wanted to post it here so that other community members can benefit.

In that example, 120 seconds is the expiration duration. The action to take is openViolationOnExpiration. This means that, if no signal is evaluated in 120 seconds, a Loss of Signal violation will be opened. This has nothing to do with the threshold duration (600 seconds), since that determines how long the evaluation system will wait to open a violation given data that is breaching the threshold . These are two different types of violation – one opens if no data is sent for 2 minutes, the other opens if violating data is sent for 10 minutes.

As an example, imagine the evaluation system in this case sees 8 consecutive data points below 2 (imagine a string of 0 and 1 values – breaching the threshold). Aggregation windows are 1 minute long by default, so this is 8 minutes’ worth. It’s keeping track, and is waiting to see another 2 data points below 2 before it opens a violation. Imagine then, for minute 9, no data comes in (this is considered a NULL value and can’t be numerically evaluated vs. a threshold). That resets the clock for the threshold timer (it needed 600 seconds consecutively with breaching data), but it starts the clock for the expiration duration timer. If one more minute passes with no data, a Signal Lost violation will open.

This is also explained in these docs, with this passage:

Loss of signal expiration time is independent of the threshold duration, and triggers as soon as the timer expires.

I hope this helps to better understand how the expiration duration and threshold duration work together.

3 Likes

Great info @Fidelicatessen. Thanks for posting your answer. Can you add how filling data gaps fits into this example?

@cgambrell

Absolutely!

If the system gets a data point before LoS kicks in and notices that there was missing data (NULL values) between this new data point and the last data point it got, it will consider that a gap and fill it with whatever you’ve configured for Gap Fill.

If there is no setting for Gap Fill, it will leave those data points empty (or NULL) and will not be able to evaluate them.

If the missing data lasts long enough for the LoS action to kick in, Gap Fill is ignored and LoS takes over.

The only time Gap Fill is activated is when a data point is received after there were some aggregation windows with missing data. The system then retroactively fills those data buckets and does evaluation on them.

An example that I came up with on the fly would be a signal that you know comes in once every 5 minutes. The data points are always separated by 4 minutes of no data, but you want to use a for at least threshold. So you set up gap filling to use the last reported value and fill gaps with that, and then a LoS setting of 5 minutes, since you know that in normal operation, you should never have more than 4 minutes of no data. In this way, Gap Fill and LoS work hand-in-hand to fill in data gaps when they’re expected but to treat the gap as a loss of signal if it goes on longer than expected.

Does this help to understand?

3 Likes

@bgoleno

We have an NRQL alert based on a count(*) query on the Transaction table. There are transactions for lots of applications across our company in this table. I believe that, if transactions from the application we are monitoring disappear, this will not count as ‘loss of signal’ (assuming other applications continue to generate transactions), the query will still return zero and the alert will continue to function as it does today. Is that right?

Query is SELECT uniqueCount(host) FROM Transaction WHERE appName='...' AND request.uri = '...' AND httpResponseCode='200'

1 Like

I am wondering the same thing

Please give a light noogie to whomevers idea it was to do this breaking change to uptime alerts less than a week before a major US Presidential election. :frowning:

@ggreer I can understand your frustration but must state that “flogging” or the suggestion of it is not something permitted in this community. I will be reaching out to our product team to see if they have anything further to add in response to the additional scenario that has been shared. Thanks :slight_smile:

Tongue in cheek, of course. :slight_smile:

1 Like

@ggreer , I have arrived for my noogie. :pensive:
I apologize for any inconvenience this may have caused. I hope you do find the new implementation of uptime detection / loss of signal detection more configurable, and more reliable. If your account has yet to be enabled, and you have timing concerns, please talk with your account team immediately, and have them coordinate a call so that we can discuss your specific situation. Your success is our primary concern.

With regards to timing, this is a long running architectural upgrade, delivering increased performance and reliability that we had to launch before this holiday season. The first week of October, we launched this to all accounts that did not have a noticeable level of NRQL Conditions that may have been configured to watch for entities going offline. For the remaining accounts, that may have crafted NRQL Conditions to trigger when the signal stopped, we held their enablement back until the end of the month to give extra time to accommodate. Making assumptions that you are in the news / media business, we want to be considerate of your holiday-season equivalent. Please reach out to your account team, or if need be, you may contact me directly at bgoleno@newrelic.com.

2 Likes