New Relic | Monitor, Debug and Improve Your Entire Stack

Announcing: New Relic One Streaming Alerts for NRQL conditions

35
Replies
4
Views
4y
Activity
  • Alerts
    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      September 21, 2020 at 3:40 AM

      UPDATE 11/20/2020

      If you are migrating your NRQL alert conditions from the old system to the new Streaming Alerts Platform, please be aware that this open-source tool is available to help ease the toil involved with that!

      -------

      ***** UPDATE 11/5/2020 *****

      As of 10/30, most all of the accounts have been enabled for New Relic Streaming Alerts. There are a very small number of accounts that have not been enabled. Those accounts will see see a banner that says ”opt in“ . If you see that banner, talk with your New Relic admin on the agreed upon timeline.

      -------

      New Relic is rolling out a new, unified streaming alerts platform for New Relic One. This new streaming alerts platform will power NRQL Alert Conditions, and over the next year, all alert condition types will be consolidated into NRQL conditions.

      New Relic One Streaming Alerts delivers:

      • More reliable alerting that is far less susceptible to data latency and processing lag.
      • Increased accuracy of the data points that are being evaluated
      • Reduced time-to-detect through improvements in the streaming algorithm, and configurable aggregation duration.
      • Greater control over the signals being monitored. You can specify how to evaluate signal gaps, when to consider a signal as lost, and what actions should be taken.
      • Consistent behavior and configuration of Alert conditions regardless of the telemetry type, source of the signal being monitored, or specifics of your NRQL query.
      • Increased scalability in the number of time series that an Alert Condition can monitor and in the total number of conditions that can be configured

      Opt-in Migration

      When we roll out this new streaming platform, there is a change in behavior related to how we process aggregation time windows that do not have data. If you are monitoring for when a signal goes to ”0“ in order to determine if an entity stops reporting, this approach will no longer work after moving to the new platform. To maintain this functionality you must enable Loss of Signal detection on these conditions in advance of moving your account in order to prevent false negatives. You may opt-in to this new platform now. Read more about the rollout plan in the FAQ section below.

      Increased Reliability and Accuracy

      This new streaming platform upgrades the streaming algorithm to an event-based mechanism that uses the incoming data points to move the streaming aggregation windows forward. The current model uses the clock on the server to trigger aggregation. With the new approach, an aggregation window will wait until the related data points arrive, thus greatly decreasing any negative effects that may be caused by lag in a data stream. This will also greatly reduce the alert latency and improve accuracy for Cloud Integrations that use a polling based integration.

      Configurable Gap Filling Strategies

      Not all signals or time series that are being monitored have a consistent flow of data points. The streaming alerts platform evaluates time windows of a specified duration. In many cases, the telemetry signals you send to New Relic will have gaps, meaning that some time windows will not have data. With the new streaming platform, you can specify how we should evaluate those gaps. You can also set different gap filling strategies, sometimes called extrapolation strategies, for each alert condition.

      Loss Of Signal Detection

      The NR One Streaming Alerts Platform now provides official support for Loss of Signal Detection. While there are workarounds to achieve this in the current platform, they are inconsistent, and the shift to an event based streaming algorithm disables that workaround. With configurable Loss of Signal Detection, on any NRQL Alert Condition, you simply specify how many seconds we should wait from the time we saw the last data point before we consider that signal to be lost. Once that time expires, you can choose to be notified of the Loss of Signal, or you can simply close any open violations if you expect the entity or signal to go away.

      Faster alerts (Sub-minute time-to-detect)

      With the NR One Streaming Alert Platform, all telemetry data can be evaluated in sub-minute timeframes. We will allow you to configure the aggregation duration down to as low as 5 seconds, and increase it to a maximum of 15 minutes. This, combined with the benefits of the event-driven streaming algorithm will allow you to achieve sub-minute time-to-detect while increasing both accuracy and reliability. Depending on your data configuration and the requirements of your scenario, you can achieve a time-to-detect as low as 10-15 seconds.

      --------- Frequently Asked Questions. -----------

      Q: When is this available?

      A: THIS IS NOW ENABLED ACROSS ALL ACCOUNTS

      Q: How do I request to have our account(s) enabled?

      A: If you account is not yet enabled, please talk with whomever manages your New Relic account in your organization.

      Q:Is there any Documentation?

      A: Yes. An overview of Loss of Signal and Gap Filling Strategies, along with how to configure them in graphQL is documented here: https://docs.newrelic.com/docs/alerts-applied-intelligence/new-relic-alerts/alert-conditions/create-nrql-alert-conditions

      Q: How do I manage these features?

      A: You can configure these features on NRQL Conditions using the UI, GraphQL API for NRQL Conditions, and the REST API for NRQL Conditions.

      Q: Can I configure these settings before having the new streaming platform enabled?

      A: Yes, if you are opting in before 10/5, we can enable the UI for you before you enable the account. This will allow you to update your NRQL conditions, if needed, before the features are enabled. After the week of 10/5 , all accounts will have access to the UI and APIs. If your account is not enabled during that week, you can use the UI and API to update any alert conditions before having these new features enabled.

      Q: Will the NR One Streaming Alerts Platform cover all alerting services?

      A: Only NRQL Conditions will receive the full set of New Relic One Streaming Alerts functionality. APM, Infrastructure, and Synthetics alerts will be migrated to NRQL Conditions over the course of the year.

      Q: Are all of the features mentioned above available?

      A: Gap Filling, Signal Loss Detection, and configurable aggregation duration are available now. The event based streaming algorithm will be released later in the year.

      Q: Will this eliminate false positives.

      A: No, but this should greatly reduce false positives. Eliminating false positives and false negatives is an audacious goal that all alerting engines continuously combat and one we continue to work toward.

      Additionally, Loss of Signal Detection is monitoring for the absence of data for a period of time. Whenever clock time is involved, there is a higher chance of false positives when there is significant disruption to the flow of data. If there is known latency within the New Relic platform, we take that into consideration, but that does not address all possible signal disruptions between the data collection source and the New Relic One Streaming Alert Platform.

      Q: I have more questions, how can I get answers ?

      A: Please reach out to your account teams if you have questions or concerns.

      Alternatively, you can ask questions in the discussion area below, and a New Relic community leader will answer. For a deeper dive into what is new, and how to best use these new features, sign up for New Relic Nerd Days on October 13, and check out the Alerts session at 2:00 PM PST. I will share the recording here afterwards.

    • 6MM

      6MM

      September 21, 2020 at 3:18 PM

      Nice. Is it correct to assume that there is also an agent version dependency to obtain Sub-minute time-to-detect functionality?

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      September 21, 2020 at 4:25 PM

      Colonel Steve Austin,

      Yes, for true, end-to-end sub-minute time-to-detect, you will need an agent - whether a New Relic agent or an open-source agent, that has a short flush cycle.

    • 6MM

      6MM

      September 21, 2020 at 4:47 PM

      @bgoleno

      If I’m not mistaken the min NR agent versions required for this function will be as such:

      [Attachment]

      I think it would be helpful to note the dependency as customers interested in making use of the new functionality may also need to upgrade agents.

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      September 26, 2020 at 6:06 PM

      We released the REST API controls for these new functions on 09/25/2020.

      If you are using the REST API for NRQL Conditions to manage your alerts, PLEASE update your scripts with the new functions.

    • shaun.titus

      shaun.titus

      October 5, 2020 at 11:21 PM

      I noticed the evaluationoffset is being moved under the ‘signal’ container. Does this mean there’s a major version bump coming to the NerdGraph APIs? This will definitely be a breaking change for me.

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 5, 2020 at 11:56 PM

      Fear not @shaun.titus. It exists in both places for backwards compatibility. The place that you are using is marked as deprecated, but still supported. When you make future updates, like to support the new loss of signal and gap filling strategies, you may wish to update this location.

    • Fidelicatessen

      FidelicatessenAssociate Product Manager

      October 12, 2020 at 9:42 PM

      Hi @Tiago.DeConto

      I answered this question in your support ticket as well, but I wanted to post it here so that other community members can benefit.

      In that example, 120 seconds is the expiration duration. The action to take is

      openViolationOnExpiration
      . This means that, if no signal is evaluated in 120 seconds, a Loss of Signal violation will be opened. This has nothing to do with the threshold duration (600 seconds), since that determines how long the evaluation system will wait to open a violation given data that is breaching the threshold . These are two different types of violation - one opens if no data is sent for 2 minutes, the other opens if violating data is sent for 10 minutes.

      As an example, imagine the evaluation system in this case sees 8 consecutive data points below 2 (imagine a string of

      0
      and
      1
      values - breaching the threshold). Aggregation windows are 1 minute long by default, so this is 8 minutes’ worth. It’s keeping track, and is waiting to see another 2 data points below 2 before it opens a violation. Imagine then, for minute 9, no data comes in (this is considered a NULL value and can’t be numerically evaluated vs. a threshold). That resets the clock for the threshold timer (it needed 600 seconds consecutively with breaching data), but it starts the clock for the expiration duration timer. If one more minute passes with no data, a Signal Lost violation will open.

      This is also explained in these docs, with this passage:

      Loss of signal expiration time is independent of the threshold duration, and triggers as soon as the timer expires.

      I hope this helps to better understand how the expiration duration and threshold duration work together.

    • cgambrell

      cgambrell

      October 14, 2020 at 8:26 PM

      Great info @Fidelicatessen. Thanks for posting your answer. Can you add how filling data gaps fits into this example?

    • Fidelicatessen

      FidelicatessenAssociate Product Manager

      October 15, 2020 at 2:50 PM

      @cgambrell

      Absolutely!

      If the system gets a data point before LoS kicks in and notices that there was missing data (

      NULL
      values) between this new data point and the last data point it got, it will consider that a gap and fill it with whatever you’ve configured for Gap Fill.

      If there is no setting for Gap Fill, it will leave those data points empty (or

      NULL
      ) and will not be able to evaluate them.

      If the missing data lasts long enough for the LoS action to kick in, Gap Fill is ignored and LoS takes over.

      The only time Gap Fill is activated is when a data point is received after there were some aggregation windows with missing data. The system then retroactively fills those data buckets and does evaluation on them.

      An example that I came up with on the fly would be a signal that you know comes in once every 5 minutes. The data points are always separated by 4 minutes of no data, but you want to use a

      for at least
      threshold. So you set up gap filling to use the last reported value and fill gaps with that, and then a LoS setting of 5 minutes, since you know that in normal operation, you should never have more than 4 minutes of no data. In this way, Gap Fill and LoS work hand-in-hand to fill in data gaps when they’re expected but to treat the gap as a loss of signal if it goes on longer than expected.

      Does this help to understand?

    • thomas.elder

      thomas.elder

      October 16, 2020 at 10:43 AM

      @bgoleno

      We have an NRQL alert based on a count(*) query on the Transaction table. There are transactions for lots of applications across our company in this table. I believe that, if transactions from the application we are monitoring disappear, this will not count as ‘loss of signal’ (assuming other applications continue to generate transactions), the query will still return zero and the alert will continue to function as it does today. Is that right?

      Query is

      SELECT uniqueCount(host) FROM Transaction WHERE appName="..." AND request.uri = "..." AND httpResponseCode="200"

    • cedric.l.dana

      cedric.l.dana

      October 20, 2020 at 3:07 PM

      I am wondering the same thing

    • ggreer

      ggreer

      October 20, 2020 at 5:07 PM

      Please give a light noogie to whomevers idea it was to do this breaking change to uptime alerts less than a week before a major US Presidential election. 😦

    • Joi C.

      Joi C.Program Manager - Support Forum

      October 20, 2020 at 6:48 PM

      @ggreer I can understand your frustration but must state that ”flogging“ or the suggestion of it is not something permitted in this community. I will be reaching out to our product team to see if they have anything further to add in response to the additional scenario that has been shared. Thanks 🙂

    • ggreer

      ggreer

      October 20, 2020 at 7:31 PM

      Tongue in cheek, of course. 🙂

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 20, 2020 at 10:38 PM

      @ggreer , I have arrived for my noogie. 😔

      I apologize for any inconvenience this may have caused. I hope you do find the new implementation of uptime detection / loss of signal detection more configurable, and more reliable. If your account has yet to be enabled, and you have timing concerns, please talk with your account team immediately, and have them coordinate a call so that we can discuss your specific situation. Your success is our primary concern.

      With regards to timing, this is a long running architectural upgrade, delivering increased performance and reliability that we had to launch before this holiday season. The first week of October, we launched this to all accounts that did not have a noticeable level of NRQL Conditions that may have been configured to watch for entities going offline. For the remaining accounts, that may have crafted NRQL Conditions to trigger when the signal stopped, we held their enablement back until the end of the month to give extra time to accommodate. Making assumptions that you are in the news / media business, we want to be considerate of your holiday-season equivalent. Please reach out to your account team, or if need be, you may contact me directly at bgoleno@newrelic.com .

    • thomas.elder

      thomas.elder

      October 21, 2020 at 9:45 AM

      @bgoleno

      Would you be able to look at the question I asked last week (above)?

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 21, 2020 at 3:47 PM

      Thomas, given the query that you have shared, you are correct that as long as some data is coming in for the data that matches the ”where“ clause, then the loss of signal expiration timer will continue to be reset. However, if you were to add a ”facet“ clause to that query, then N number of individual time series signals would be created, and each would be separately streamed and evaluated. In that case, if one of the individual time series streams stopped receiving data, then that would trigger a loss of signal event.

    • Rishav_old

      Rishav_old

      October 21, 2020 at 4:13 PM

      Does it have ways to distinguishing between ”signal loss“ and ”very low-throughput“?

      In our previous ham-fisted attempted at alerting on this type of issue by monitoring the throughput, we still came across too many false-positives due to low-throughput apps.

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 21, 2020 at 4:41 PM

      Ravish,

      Loss of Signal detection just works as a ”dead man switch“ that gets reset when each data point arrives. If you have a low thoughput app, you will not want to set the loss of signal duration too low, and you may wish to use the ”last value“ gap filling strategy to keep the evaluated value the same until the next data point changes it. Additionally, you may try making your aggregation windows longer. This will result in aggregation windows having data more often,

    • thomas.elder

      thomas.elder

      October 22, 2020 at 7:04 AM

      @bgoleno

      Thanks. Can I clarify - if nothing is returned by the WHERE clause, does that count as signal lost?

    • tzafrir.kost

      tzafrir.kost

      October 26, 2020 at 4:04 PM

      @bgoleno Can you please clarify regarding thomas.elder’s question-

      If nothing is returned by the WHERE clause, does that count as signal lost?

      Also, if the above happened for 5 consecutive minutes, and I have gap-filling strategy configured to fill in 0 for each such minute without data, will these still be counted as 5 minutes of lost signal?

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 26, 2020 at 11:39 PM

      The data points that make it past the ”Where“ clause is what goes to the streaming platform, and is considered the ”signal“ that streaming alerts sees. Every time streaming alerts sees a data point, it resets the Loss of Signal Timer. If you have a service that sends a true | false status beacon every 15 seconds, and it only sends ”false“ very sporadically , if your query is a " count(*) WHERE status = ‘false’ " , then that alert may open a violation if the count exceeds the threshold. When the status returns to ‘true’, there would be no incoming data events to trigger the evaluation to close the violation. One would need to use Loss of Signal to close that violation.

      If you have gap filling as static:0 , and you have a Loss of Signal expiration.duration of 10 minutes, then if there is a 6 minute gap between data points, when that second data point comes in, before evaluation is triggered, the empty aggregation windows would be filled with a ”0“ , and the data will be evaluated.

      If the Loss of Signal duration timer expires first, then we will consider that signal as lost, we will trigger whatever action you define, and clear out the buffers.

    • khastwell1

      khastwell1

      October 28, 2020 at 3:35 PM

      At what time today are the New Relic One alerts being enabled? Did it already happen?

    • Brian Goleno: Product Manager: Alerts

      Brian Goleno: Product Manager: AlertsMost Awesome Product Manager

      October 28, 2020 at 3:59 PM

      We started the incremental roll out at 6:00 am this morning (pacific) . The complete rollout process will take about a week. When your account is enabled, you will see a banner across the top that no longer says ”opt-in“

    • Rishav_old

      Rishav_old

      October 29, 2020 at 12:59 PM

      Have to say, I’m surprised by the lack of mention of the dedicated Loss of Signal Alerts Migrator app in this thread; the tool seems very pertinent to this chat and I only just came across it due to a single mention in an old email.

    • ggreer

      ggreer

      October 29, 2020 at 6:47 PM

      This causes a lot of headache for us because it is incredibly confusing that a

      count(*)
      with threshold of 1 on ”edge-triggered failure events“ never gets a value of 0 to reset. 😦

    • ggreer

      ggreer

      October 29, 2020 at 6:50 PM

      We’re having a similar issue where trying to alert on ”edge-triggered failure events“ (like a Kubernetes container restarting) may not work anymore. Ex:

      SELECT max(restartCount)-min(restartCount) FROM K8sContainerSample WHERE containerName = "${var.deployment}" and deploymentName = "${var.deployment}" and namespace = "${var.k8s_namespace}" facet podName

      That used to alert whenever a pod’s container changed its restartCount but now seems to just cause ”loss of signal“ constantly. We disabled the loss of signal warning on the alert but not sure if that fixed the alert to work again or just made it do nothing.

    • khastwell1

      khastwell1

      October 30, 2020 at 8:16 PM

      It looks like most of my alerts were fixed by using the different gap filling methodologies. However, all of my alert conditions are creating false positives when the NRQL condition is no longer within the time range that I have setup. For example, any alert that uses a condition like:

      hourOf(timestamp) in ("7:00","8:00","9:00","10:00","11:00","12:00")

      ends up shooting out an alert right at the end of the time range. In this case, I would get an alert at around 12:01. This did not happen before. Any thoughts on how to prevent this behavior? I have already messed with different aggregation window and offset values with no success.

    • ggreer

      ggreer

      October 30, 2020 at 8:57 PM

      We’ve had to disable the ”loss of signal“ condition from some alerts, or whether they open or close on loss of signal, to get them back to functional. In your case that could work as well, with the downside being that you wouldn’t be able tell if you lost signal inside the window you do care about...

    • madhura.d

      madhura.d

      November 2, 2020 at 8:22 AM

      Accounts that have NRQL Conditions that may be monitoring for loss of signal will be enabled on October 28th. These are NRQL Conditions that either use the ”Less Than“ operator, or have an operator and threshold of ”Equals 0“.

      So here it is said that NRQL conditions that either uses ”<“ or ”=“ to operator and threshold value of 0.

      What is the case with ”>“ operator and not equal to zero threshold value NRQL’s conditions? Is there any difference?

    • Rishav_old

      Rishav_old

      February 21, 2021 at 6:41 PM

      NRQL alert conditions appear to be 1st class citizens are far as automation goes, per Terraform documentation:

      The newrelic_nrql_alert_condition resource is preferred for configuring alerts conditions. In most cases feature parity can be achieved with a NRQL query. Other condition types may be deprecated in the future and receive fewer product updates.

      —Terraform New Relic, Resource: newrelic_alert_condition

      I wanted to focus in on the ”feature parity“ piece. How can I alert on baseline deviations across applications?

      For example, I can create an APM > application metric baseline alert condition for response time and select a number of entities across my account. Easy. But how to replicate that in NRQL alert condition? Closest I’ve got is:

      FROM Transaction SELECT average(duration) FACET appName
      

      However, the following error message is returned when threshold type ”Baseline“ is selected:

      Enter a valid NRQL query above to see a threshold preview.

      Baseline threshold type is not applicable for faceted queries.

      That’s frustrating and inconsistent. The NRQL is valid, as proven by its output in the query builder and it’s inconsistent with the experience for both ”Static“ and ”Outlier“ threshold types: both of which function as expected.

      Any suggestions would be appreciated, or if this should be addressed in a separate thread. Thanks.

    • cindyforcia

      cindyforcia

      April 17, 2021 at 8:27 AM

      @bgoleno Thanks for sharing this useful info.