Relic Solution: Patterns for Implementing Alerts Workflows

In this post, we will review various implementation patterns for configuring alert notifications using Workflows and Destinations. Workflows are very flexible and highly configurable. They can be configured in various ways to address the needs of a broad range of operational models and account sizes. As such, there are various implementation patterns that you may use depending on your requirements. This post will discuss some of the possible patterns.

First, let’s do a quick review of the Alerts & AI Ops reference model.

  • The initial configuration object is the Alert Condition. Alert conditions evaluate streaming telemetry signals to identify anomalous behavior.
  • When such behavior is detected, an Incident is opened. An incident captures the detail and state of individual symptoms of a larger problem.
  • When an incident is opened, an Issue is created. An issue is a container for a set of related incidents. The Issue collects and organizes all of the symptoms of a larger problem whose root cause needs to be determined.
  • When an Issue is activated, Workflow configurations define which Issues get routed through which Destinations in order to notify you or your downstream systems.

Implementation Patterns

Pattern : One workflow per account
Scenario : All Issues get routed to one downstream system, with a common schema.

If your organization is sending all issues to a single downstream system, such as ServiceNow or a 3rd party AI Ops platform, then you may be able to implement a very simple configuration. One workflow, without a filter defined, will match to all Issues and route it to the destination you configured to connect to ServiceNow, or other system.

Pattern : Workflows filtering on a “team” tag to notify teams
Scenario : DevOps teams manage sets of services and want notifications for all of their services to go to teams.

For organizations operating in a DevOps model, a team builds and operates multiple services and owns many entities. The team will often have a shared rotation for who receives alerts and responds to incidents. For this approach, we recommend first ensuring that a “team” tag is added to all Incidents. With that in place, add a filter to a workflow to match on the “team” tag, and route the issue to the desired destination, such as PagerDuty or Slack.

This approach can be expanded in various ways, such adding a check for an “environment” tag, or a checking for a “priority” to route issues to the most appropriate destination. This is a flexible and efficient pattern that does not bind your notification options to your choice of how to organize your alert conditions into policies.

Tags are added to Incidents in one of three ways: from facet attributes on a NRQL condition (aka telemetry tags), from tags on the entity that triggered the condition, or from tags associated with the Alert Condition. To ensure consistency, we recommend tagging all of your conditions with a “team” tag, or similar tag name to fit your model.

Pattern : Workflows filtering on an Alert Policy tag to notify service operators
Scenario : All alert conditions for a service are already grouped into Alert Policies, with notification destinations varied per service / policy.

For organizations that have their alert conditions for a given service organized together into a policy, and route their notifications based on that policy, you may choose to use a Policy tag as a filter in your workflows. This is most similar to the legacy alert model that restricted your notification settings based on how you organized your Alert Conditions. If this model is working well for your organization, and you do not have a very large number of policies, this model may work well for you.

This is the pattern being used by New Relic when we automatically upgrade legacy alert channels to workflows. This is because this is a mapping model that is common to all New Relic accounts. However, if you have a very large number of policies, and each policy’s notifications are not routed to a unique destination, then we recommend you use a tag based pattern, similar to the one above. Policy based patterns can run into scaling limits if you have over hundreds of policies in an individual account.

While these are recommended, or common patterns, you are by no means limited to these patterns or methods. Workflow filters are very flexible and can be adapted to any number of implementation models.