Big News For Alerts and Applied Intelligence

All of the folks behind the scenes are hard at work on a slew of new features for Alerts and Applied Intelligence, and carefully determining which old features we will be sunsetting. Below, I’ll go through what exciting things to expect, what we’ll be end-of-lifing (EOL), and what actions you can take to prepare for the changes.

Note that this is an early announcement to allow you to plan any work you need to complete over the next 12 months. Not all the dates mentioned are locked in, but the rough timeline should help you with your planning.

Why are we doing this?

In short, we’re providing a whole new alerting lifecycle experience. The new experience provides faster time to resolution, reduced noise and increased reliability

Based on your feedback, we have designed an entirely new workflow for responding to incidents that will result in faster time-to-resolution with decreased noise. It involves correlating other, related anomalies with the core incident, to raise visibility of possible problem sources across your estate. Some of these changes are directly related to this new workflow.

In addition, we’re simplifying alert configuration and standardizing on one single type of condition so that we can provide deeper features and improve reliability. Some of the new changes relate to streamlining alert conditions and the process you use to create them.

Finally, we’re rolling out a brand new method of evaluation (sliding windows aggregation) that will offer more flexibility and resiliency to thresholds that have been using the “Sum of query results” option.

Each of these initiatives involves introducing new features and EOLing old ones.

What’s coming?

  • A new incident response experience (November 2021)
    • The introduction of Issues and Incidents, which will replace Incidents and Violations, respectively
    • New Issues page, which will enrich your Issues with correlated details
    • New Nerdgraph API functionality
  • A new way to manage notifications (November 2021)
    • New name: Destinations
    • Increased flexibility and configurability
    • Ability to notify on Warning thresholds
    • Configurable notification content
  • Streamlining and simplifying alert types (January 2022)
    • All alert conditions will be configurable using NRQL, so there will be only one type to learn how to use
    • Integrating alert creation throughout the platform
  • Sliding window aggregation will allow for flexible averages to be evaluated (December 2021)
    • This will replace “Sum of query results” thresholds
    • This continues the movement toward giving you more control over the signals you’re monitoring with Alerts

What’s going away?

  • Current incident response workflow (replaced by the new incident response experience) (May 2022)
    • Incident page
    • Violation details page
    • Incident and Violation APIs
  • Notification channels and their relationship to policies (replaced by Destinations, a new way to manage notifications) (May 2022)
  • All non-NRQL alert condition types (replaced by NRQL alert conditions with increased functionality) (July 2022)
    • Condition type that will still be available, in addition to NRQL conditions: Synthetics multi-location condition types (single-location will be going away)
  • “Sum of query results” thresholds (replaced by sliding window aggregation) (May 2022)
  • Outlier NRQL alert condition types (not being replaced by any new feature) (January 2022)
    • Due to minimal adoption, this feature will be EOL’d
    • More details in the Outliers EOL announcement here

What must you do to prepare?

This section will look at changes that need to happen at a high level before the features mentioned above are EOL’d. If you are using the UI to make these changes, there will be more information (and potentially tools) on how to execute on these in the following weeks.

  • If you are using APIs related to Incident or Violation details or performing actions such as acknowledging/closing incidents, you will need to adjust your scripts to use the new Nerdgraph endpoints for Incidents (formerly known as Violations) and Issues (formerly known as Incidents).
    • If you’re not using APIs for these functions, nothing needs to be done to prepare for the new incident workflow
  • Create new Destinations to replace your existing notification channels, and create Workflows to define when and how the new Issues will handle notifications
  • Migrate all non-NRQL conditions (exceptions mentioned above) to NRQL alert conditions
    • We will be releasing tooling in the UI to help with this migration
  • Adjust your “Sum of query results” alert conditions to use sliding aggregation instead
    • Specific instructions on how to adjust these will be documented once the feature is released

How can you keep track of these changes?

Follow this thread – updates with further information and instructions will be posted here.

Where can you go for help?

You can reach out to your account team for any assistance. You’re also welcome to ask questions in this thread.

16 Likes

UPDATE!

I just changed the post up above to reflect our new plans:

Host Not Reporting (HNR) and Process Running conditions will also be going away, to be replaced by NRQL conditions. Multi-location Synthetics conditions will remain, although we are looking for potential solutions to turn them into NRQL conditions as well. Note that existing HNR and Process Running conditions will be updated to appear to users as NRQL conditions and continue working.

I will keep making updates as details change, or as we have more information about any facet of this project.

1 Like

I am bummed that outlier conditions are going away. :slightly_frowning_face: That was a cool feature.

1 Like

Thanks for the heads-up, @Fidelicatessen, appreciate the transparency with the roadmap.

  • Under “what must you do to prepare”, there are recommendations to adjust scripts to the new endpoints. Could you confirm this is heard/followed by NR’s Terraform dev team, given that large-scale enterprise-level configuration of alerts are usually managed via Terraform?

  • Regarding alerts, an easy question I’m often asked but unable to answer is “across all of our sub-accounts, how many alerts are triggered?” Are any of these developments made with multi-/sub-accounts functionality in mind.

  • Have to agree with @philweber, we find the Outlier type useful for identifying “stray” servers in our load-balanced datacentres. I suppose there’s an alternative NRQL condition to fill this gap?

  • Great to hear about the implementation of sliding window aggregations. It’s become a standard in several APM solutions (e.g., Splunk) and should fit in well as a more reliable baseline monitor.

1 Like

Grazie @rishav.dhar !
You catched all of my doubts.

@Fidelicatessen: please post updates as soon as possibile, deadline is very near.

2 Likes

@rishav.dhar

Hey thanks for your questions! I’ll answer them below.



  • Under “what must you do to prepare”, there are recommendations to adjust scripts to the new endpoints. Could you confirm this is heard/followed by NR’s Terraform dev team, given that large-scale enterprise-level configuration of alerts are usually managed via Terraform?

We’re committed to keeping Terraform as a “first class citizen,” and we don’t consider our API work done until the Terraform work is also done.


  • Regarding alerts, an easy question I’m often asked but unable to answer is “across all of our sub-accounts, how many alerts are triggered?” Are any of these developments made with multi-/sub-accounts functionality in mind.

Great callout! All of the new features we’re working on have multi-account functionality in mind. Incident Intelligence can already merge incidents from all of the accounts in your org, so the incident response function is designed to be multi-account from the get-go. Our plans are eventually to include multi-account functionality in our incident detection design, moving forward.


  • Have to agree with @philweber, we find the Outlier type useful for identifying “stray” servers in our load-balanced datacentres. I suppose there’s an alternative NRQL condition to fill this gap?

With regards to Outlier conditions, we do intend to bring on more functionality that allows you to monitor the health of the cluster as a whole, but we feel that the implementation of Outlier conditions was not sufficient to help you to know which entity is behaving differently from its peer group.

Currently, the new Incident Response Workflow is designed to give you a macro-level view of your systems as it is warning you when individual entities start to misbehave.



Let me know if you have more questions!

3 Likes

Hi @Fidelicatessen

We’ve been using NRQL alerts for our more complicated clients, mixed with terraform and terra-grunt to create the many different alerts they need and for the most part things are going well, however the teams have all started asking about using Workloads now to help visualise their system and a few have started noticed the [grey] squares next to the all entities.

Given the push towards NRQL alerts, have there been any thoughts on how to tie a NRQL alert to a specific entity, so the visual indicators and downstream associated UI elements work? Are there any workarounds we can do now? If not, I presume this is in your roadmap?

Eventually I found → View entity health status and find entities without alert conditions | New Relic Documentation

Which shows that NRQL isn’t currently supported generally in this area. However sometimes the NRQL alerts for Infra Hosts do go green, so wondered what the official answer was currently.

Thanks

3 Likes

Hi @darren.smith2

Believe me, this is something that we here at New Relic want as a top priority. We have been actively working on this, and although we have been stymied by technical issues we did not foresee, this should be available very soon!

This is especially critical, since we are going to be converting entirely over to NRQL conditions (you can find the relevant announcement at this link). We really need those NRQL conditions to recognize which entities they are covering!

2 Likes

Hi @darren.smith2

This was rolled out at the end of last week. You should have this functionality now, so long as your alert condition’s NRQL query scopes to specific, discrete entities (for example, using FACET hostname or FACET appId would scope to a specific, discrete entity).

Hi @Fidelicatessen, looking forward to this implementation for a long time so I would really appreciate if you clarify its scope:

Vast number of our hundreds APM apps are still reporting status unknown (grey), despite being covered by NRQL alert condition—much like before. Looking closer at their their APM → Alert Conditions revealed that are no NRQL conditions listed. As such, could you confirm that FACET appName is included in the entity health indicator scope? Are there potential limitations, such as minimum agent version involved?

We use FACET appName instead of appId so that the incident responder has a better idea of what the alert is about rather than having to decipher the appId. Our main account ID is 121622, to which the ~50 odd sub-accounts activity report.

Thank for your time.

1 Like

Hi @rishav.dhar

As far as I know, appName should work fine to narrow the scope down to a particular entity.

This really sounds like a malfunction. I would recommend opening a support ticket on this, since that is the quickest path to get it in front of the engineering team who owns this functionality.

Until then, I would suggest trying out a combo facet: FACET appName, appId. In case there are duplicate application names, that explicitly narrows the scope down to a single application, since appId values are unique. This may not work any better than using only FACET appName, but it would be a valuable detail to include in the support ticket.

1 Like

We are using NRQL queries to retrieve IssuesActivated, IssuesClosed across the accounts.
And also using NRQLs on “Incidents” within a subaccount. We are doing this in our scripts of SELF-HEALING. Are we going to get impacted?
Thanks,
Sagar

@sagar.thirumala

Since almost all of alerts is affected, if you’re using NRQL alerts you will likely be affected by at least one part of these changes. However, only one of these changes involve you changing the queries in your alert conditions, and that’s only if you’re using sum of query results thresholds (we’re introducing Sliding Windows Aggregation to replace those).

Beyond that, the changes mostly involve how we present incidents, how we manage notification channels and the doing away with all non-NRQL alert conditions.

1 Like

Regarding Process Running condition, would you please share any info on how to use NRQL condition to replace it?

1 Like

Hi @mtsou

Regarding Process Running condition, would you please share any info on how to use NRQL condition to replace it?

Sure!

Imagine if I have a process (imagine it has a display name of processA) and I want to make sure it’s running on my server named my-favorite-server. I would use a query like this in my NRQL alert condition:

SELECT count(*) FROM ProcessSample WHERE processDisplayName = 'processA' AND hostname = 'my-favorite-server'

If the count is 0 for some period of time, it indicates that the process stopped on the host.

Therefore, to finish, I would set up a Loss of Signal to open a new violation at some point (that’s up to you) – maybe after 10 minutes of no signal. This is because the count(*) function usually won’t return 0 values (take a look at this article, where I explain why that is).

This is a very simple example, but you can expand on this to cover whatever you need.

2 Likes

Hey everyone!

I just posted a more focused announcement specifically about Outlier alert conditions. Please take a look at this link!

1 Like

@Fidelicatessen Thanks for more details. I have a specific question on AIOps Queries. We have Queries like “SELECT title, issueId from IssueActivated …” on master account. Do these queries get impacted?

1 Like

Awesome @Fidelicatessen - just got back off Vacation to see lots of shiny green boxes across our many client accounts - please pass our thanks to your teams, awesome job!

1 Like

@sagar.thirumala

Note that this list of new features and EOLs pertains only to Alerts and Applied Intelligence. This is not a query that would work with a NRQL alert, since there is no aggregator (e.g. average, min, max).

Since this is not a query that is used in a NRQL alert condition, none of the changes listed in the original post should apply to it.

1 Like