Up and Down Notifications

Hi,
I’m coming over from statuscake where it’s set so that if a site goes down, we receive a notification that it’s down. And then when it’s back up, we get another notification (we don’t want to get a notification for every check once we’ve been notified once that it’s down).

We’re tracking around 50 sites right now – what is the policy setting we should use so that for each site, if it does down we get one notice, and then another notice when it’s back up and running?

Thank you for your help!

  • How does this differ from what were you expecting to see?

  • If you aren’t seeing expected alert or data, please provide a link to the incident or violation (policy, condition, data app etc.)

Helpful Resources
Relic Solution: The key to consistent alert notifications

Troubleshooting downtime document

Hi, @davidw1: In New Relic Alerts, notifications are associated with an alert policy. You can think of an alert policy as a container for one or more alert conditions, and zero or more notification channels.

In the simplest configuration, you might have a single alert condition (“Synthetic check has failed”) and a single notification channel (an email address). In that configuration, New Relic will automatically send a notification when the condition is violated (the site is down), and another when the violation is closed (the site comes back up). If someone clicks a button to acknowledge the alert incident, New Relic will send an additional notification to let you know that someone has acknowledged the incident.

Things get more complex when your alert policy contains multiple conditions: Do you want to be notified when each condition is violated, or only the first one? You may use the incident preference setting to choose the desired behavior.

A condition may also contain two thresholds: warning and critical. Only violations of the critical threshold will create an alert incident and potentially send a notification.

Please let us know if this helps, or if you have additional questions.

3 Likes

To build on what @pweber has said, I’d like to highlight that Alerts sends notifications for exactly three events:

  • When an incident is created
  • When an incident is acknowledged
  • When an incident is closed.

For your use case, I would start with creating a policy to contain all of my Synthetics conditions. You can have up to 500 conditions per policy so this should more than meet your needs. If you set your incident preference to be by condition you’ll get an incident created for each condition that violates. Since Synthetics conditions target exactly one monitor, this will create an incident whenever your monitor enters into a FAILURE state and send a notification. Once the monitor recovers, the incident will close and send a notification to that effect.

Some customers have found that Synthetics conditions can be noisy. Because of the nature of network traffic, a site may be temporarily unavailable from a subset of our Synthetics monitor locations while still being up to the rest of the world. To help alleviate this noise, you are able to create conditions that will only violate when a configurable number of locations are failing simultaneously. I would highly recommend this condition type if you are using multiple locations to keep track of your uptime!

If you’re using something other than a Synthetics monitor to determine a site is down, then you will need to create conditions for those checks. Examples include Host Not Reporting conditions that are part of our Infrastructure product.

4 Likes

Thank you both! for your help!

Let us know if there’s anything else we can do to help or clarify for you :smiley:

OK I think I got it:

I have an alert policy that I called “Website Down for More than 10 Minutes” (because I have the synthetics time to check every 10 min).

Under that policy heading I have 40 sites (and 40 conditions) – and each condition is " SYNTHETICS MONITOR FAILURE" (and I made the name on them all “check failure”).

I have the policy set to “by condition and entity” because I want to see every time an individual entity fails the condition (site 1, site 2, site 3), etc. Is this the correct way to do it?

What I really want to say is, “Hey, synthetics – here’s a group of 40 sites that are all set with 10 minute checks. If one of them fails the check, please fire an alert, but only fire one alert while the site is down. If it’s down for 40 minutes, don’t fire 4 alerts. Just alert me when the next check is up”

Thank you!!!

That should work out just fine!

You’ll get an incident for each condition that violates, those incidents will have their own new set of notifications.

So potential for 40 notifications at once (in an absolute worst case scenario).

Each condition will stay in violation until the synthetics check is passing again, once it is passing, the violation should auto-resolve itself, thus closing out the incident, sending you an Incident Closed notification.

If one of them fails the check, please fire an alert, but only fire one alert while the site is down. If it’s down for 40 minutes, don’t fire 4 alerts. Just alert me when the next check is up”

With this set up you may get 4 notifications, but only if 4 sites are down, so 4 conditions failing. But I think that’s what your aiming for, right?

Thank you. So here’s a live incident.
I had a check fail from two out of four locations.
And I received a total of 4 emails – one for each failure, and one for each resolution. :slight_smile: – which is what I wanted to do.

Does this mean that the site was down for 3 hours? That is, that further checks over the course of time also failed until the one finally passed?

Thx

And does the ID shown correlate to any other process ? That is, I want to see if something happened on the server that caused the problem (out of memory, etc.) or if it was something local to the site.

Thx

Without looking at your monitor (although I would be happy to if you paste a link to the monitor and the condition we have been discussing, don’t worry only you and NR Admins can use it) it looks like the site was down for three hours. Or, more explicitly, that it was unreachable from at least one of the sites your checks are originating from. It’s a bit nuanced and we can go deeper there if it will be helpful but let’s make sure that’s correct, first.

The ID shown here doesn’t correspond to anything else. It is unique to your account (there is only one incident with that ID for your account, ever) but beyond helping you refer to the incident in question, it is not used anywhere outside of Alerts.

One approach you could take, and one that I have used to great success myself when monitoring New Relic services with New Relic, is to shift your thinking from organizing your conditions by type and towards organizing them by application.

When I do this, I like to create a policy for the application in question and start with the end user experience. Thinking of ways that could degrade and choosing what to monitor by working my way back to bare metal. Your synthetics check is a great example of this! Another could be using the Browser agent and monitoring page load timing. However you want to measure a degraded user experience is up to you and we have a few options.

So then I ask “what would cause that?” Well maybe a slow transaction or a high error rate. So we can then turn to APM and create a condition that monitors that. And of course we continue backwards through the stack. A high error rate could be the result of running out of memory and entering GC death spirals or having too much CPU used. So we have things like JVM health metric conditions or the Infrastructure agent.

By putting these conditions all in one policy, and having the incident preference be “by policy” you would get one incident that all of the violations for every condition in that policy would roll up into. It can really help keep the context grouped together, if what you’re wanting is something to tell you, in one place, everything that violated for that service during the same time period.

Both of these approaches are really valid and different use cases benefit from one or the other (or a hybrid of them!) in different ways. It might be useful to experiment with them, and also tweak it in a way that feels like it works for what you care about. If you do make some changes to your approach I would love to hear back from you, with questions or tales of success.

@parrott thank you so much for the detailed reply! In truth, my monitoring is really just to make sure that there isn’t a huge outage (if the whole box goes down) – and/or to see if there’s a site as in this case that isn’t responding.

And yes, I actually started to work backwards from the timeframe in question for the individual website/application and I was planning on posting about that tomorrow as I try to connect the dots to see what happened.

I’ve watched many of the APM videos, but I still need practice putting the data together to make a diagnosis.

1 Like

Looks like you got that posted on diagnosing a problem in NR.

I’m sure we’ll have plenty of other explorers interested in adding their thoughts on diagnosing problems in NR over there. And we’ll see what we can add too. :smiley:

You’re very welcome. I love helping people use the things I build :smiley: I appreciate you asking some great questions here so that others can also benefit from the conversation.

1 Like