Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Relic Solution: Understanding the timeline of an Alert incident

levelup

#1

Introduction

When I first started training on Alerts, I had a hard time interpreting the various entries and timestamps on the incident logs. Creating a timeline for myself to understand the incident flow helped a lot - and I thought some of our customers might find it helpful too!

The scenario

Let’s say that you have one cat, his name is Florin. But you also have a cat door, and there’s a neighborhood cat, Guilder, who keeps sneaking in when you forget to lock it at night. He’s harassing Florin and stealing his treats, and this unwanted behavior is costing you a fortune in cat treats. You want to be notified if the little sneak gets in, and to do this you’re going to set up an alert condition that monitors the HouseCat count and notifies you if the count exceeds the expected value of 1.

To do this, you create an Alert policy, and you add a NRQL alert condition using this query:

SELECT count(*) from HouseCat

You don’t care too much if Guilder gets in as long as he leaves again right away and doesn’t hang around starting fights and raiding the food supply. So you decide that you only need to create a violation if he’s in the house for at least 10 minutes. So you set your threshold to when the query returns a value above 1 for at least 10 minutes:

(Note that I’m not using an evaluation offset for this scenario to keep the example simple. If you’re interested in looking more at that, check out this other post by a fellow support Relic: Demystifying the timestamp difference between violations and incidents)

The timeline

Your new alert policy gets tested out the very first night when you forget to lock the cat door and Guilder breaks in at 2:05 a.m. Let’s step through what that timeline is going to look like:

  • 2:05 a.m. The HouseCat count is now at 2, meaning that your threshold has been exceeded. No violation has been created yet though, because your threshold specifies that the HouseCat count needs to remain above 1 for at least 10 minutes in order to create a violation.

  • 2:10 a.m. Guilder is still in the house, HouseCat count still at 2.

  • 2:15 a.m. Guilder’s still here and has started trying to get the lid off the cat treats. More importantly, because we now have 10 minutes of HouseCat count > 1, we have a violation and an incident is opened.

VIOLATION OPEN 2:15 A.M.

INCIDENT OPEN 2:15 A.M.

It takes a few minutes for you to fully wake up and deal with this nonsense. You throw Guilder out, lock the cat door, and go back to bed.

  • 2:25 a.m. The HouseCat count = 1 again for the first time since the incident began.

So, the incident is over now… right? Not quite.

Just as we waited for 10 minutes of HouseCat count > 1 in order to open a violation, we’re need to see the exact reverse in order to consider this incident over. In this case, that means we need to see the HouseCat count back at 1 for at least for 10 minutes. This is called the recovery period and it begins as soon as the count returned to the acceptable level.

This is by design. Remember, we can’t see what happened - we don’t know if Guilder just left for a minute to invite in a few friends or if you actually addressed the root cause (the unlocked cat door). So, just to be safe, we’re waiting it out.

  • 2:25 a.m.: Recovery period begins

Assuming that you did your job on that cat door, we keep waiting…

  • 2:30 a.m. HouseCatcount = 1

  • 2:35 a.m. HouseCat count = 1

  • 2:35 a.m. Recovery period is over

When you wake up the next morning and are reviewing the incident notifications and logs over your morning coffee, you’ll see these entries in the incident log:

VIOLATION CLOSE 2:25 A.M.

INCIDENT CLOSED 2:35 A.M.

So why are there two different time stamps? Shouldn’t the violation and incident have closed at the same time? Yes and no. When the incident closed, so did the violation. But the timestamp you see in the log for the violation’s end has been backdated. It represents the beginning of the recovery period, when the activity first dropped back down below the threshold.

Why this matters

For a situation like the one above, with a fairly short time window (and recovery period) and no third-party tools in use, you might not even have noticed that backdating. But let’s say you’re using a longer time period in your threshold, 45 minutes. In that case, your incident timeline would look like this:

  • 2:05 a.m. HouseCat count = 2

  • 2:50 a.m. HouseCat count = 2

VIOLATION OPEN 2:50 A.M.

INCIDENT OPEN 2:50 A.M.

  • 3:00 a.m. HouseCat count = 1

  • 3:00 a.m. Recovery period has begun

  • 3:45 a.m. HouseCat count = 1

  • 3:45 a.m. Recovery period has ended

And your incident log will look like this:

VIOLATION CLOSED 3:00 A.M.

INCIDENT CLOSED 3:45 A.M.

And furthermore, let’s say you’re using a third party tool like PagerDuty, and when it generated a list of open violations at 3:30 a.m., this one was on that list. What the heck? The violation closed at 3:00 a.m.!

But remember, the time shown in that violation’s end time was backdated. The violation will still show on a list of open violations until 3:45 a.m., when the incident closed.

Alternative: closing the incident manually

The above timelines all assume that you’re letting the violation auto-close. If you want to immediately mark the violation (and by association, the incident) as closed, you can do that manually, in which case the above timeline would look like this instead:

  • 2:05 a.m. HouseCat count = 2

  • 2:50 a.m. HouseCat count = 2

VIOLATION OPEN 2:50 A.M.

INCIDENT OPEN 2:50 A.M.

  • 3:00 a.m. HouseCat count = 1

  • 3:02 a.m. You log into New Relic and manually close the violation, effectively telling us that you’re aware of the Guilder problem and have taken care of it. This ends both the violation and incident immediately and now your log will look like this:

VIOLATION CLOSED 3:02 A.M.

INCIDENT CLOSED 3:02 A.M.

And now when PagerDuty pulls that list of open violations at 3:30 a.m., you won’t see this incident on it because both the incident and the violation are over.

Resources

I hope this helps you better understand your Alert incident timeline and the expected behavior! But before you go, I encourage you to also check out some of our other excellent resources that can help you become an Alerting pro!