Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

August 2017 Coffee Chat: Let's Talk about Alerts


#1

Some of you may not know: the New Relic Alert’s platform has gone through an exciting evolution since the time it was created.

See for yourself:

https://discuss.newrelic.com/c/alerts

On August 25th, we have the man/myth/legend who led New Relic Alerts to what it is today, Nate Heinrich @NateHeinrich joining us for a special Coffee Chat!

During the month of August we are featuring Alerts content in the Online Technical Community so that all of you can add :white_check_mark: Alerts Expert to your New Relic resume.

So grab an iced coffee and be sure to sign up today for the chance to talk features, changes, possibilities, and solutions with Nate!

Friday, August 25th at 9:30 am Pacific / 12:30 pm Eastern

Register Today

Hope to see you on the 25th!


August 18, 2017 Post of the Week—Insights, AMP and Alerts
October 2018 Coffee Chat: Best Practices & Getting started with New Relic
September 2018 Coffee Chat: Treasure Hunt
December 2018 Coffee Chat: Insights & Dashboards & NRQL - OH MY!
#2

Alerts Coffee Chat with @NateHeinrich - Transcript

@hross [9:26 AM]
Hey @here - we’re going to get started in about 5 minutes. Everyone have their :coffee: or :tea:

@Anup.Jishnu [9:30 AM]
no :coffee: it’s :hamburger: time :grinning:

@hross [9:30 AM]
Fair enough @aj! I’d eat a :hamburger: any time tho…

@hross [9:31 AM]
Hi folks! I’m Holly, and I help run the Online Technical Community with the amazing @lculver. I’m moderating the chat today.

Our Alerts PM @nheinrich was gracious enough to take an hour of his day to answer your questions. There have been so many great things happening in Alerts that I am cetain there is lots to talk about.

@NateHeinrich - tell us what you’re up to and what you’re excited about most!

@NateHeinrich [9:33 AM]
Hey everyone! Happy to be here. Looking forward to some fun discussion.

Most recently I’ve super excited about our NRQL Baseline alerts feature we released recently.

@hross [9:34 AM]

@NateHeinrich [9:34 AM]
It basically combines all the filtering power of NRQL and the awesome tech behind baselines to allow customers to operationalize on key metrics they couldn’t before.

@hross [9:35 AM]
@Anup.Jishnu - did you come with any questions at the ready?

@Anup.Jishnu [9:35 AM]
yes. My main question is with regards to notifications

Notifications on Less Critical events

It is a bit complex at present to manage Critical and less critical notifications
any plans to ease it?

@NateHeinrich [9:38 AM]
Ah, yes. I hear about that fairly frequently. Though I have a strong opinion that all notifications should require immediate action, I know there are workflows out there that require things like warnings event to be sent to “low priority” channels.

@Anup.Jishnu [9:40 AM]
Yes all notifications need immediate attention, Critical notifications would trigger “Major Incident Process” and would involved management, Less Critical would be a heads up to DevOps or Dev team to jump in and fix stuff before it becomes Critical

@NateHeinrich [9:40 AM]
We’ve been talking internally about a few approaches that can give customers some of this flexibility as an option. However, it isn’t on the immediate roadmap.

@Anup.Jishnu [9:40 AM]
ok

@NateHeinrich [9:41 AM]
Totally makes sense. You have two different workflows in your company that need to be kicked off based on the type of violation that occurred.
@Anup.Jishnu curious, does one process escalate into another if severity increases during the incident?

@Anup.Jishnu [9:42 AM]
Sure, e.g: CPU is spiking and once it crosses 70% its a Less Critical event. But if it hits 90% it is Critical.Due to
the way we have set up servers it should not normally cross 50%.

So yes Less Critical escalates to Critical

The Warning threshold that NR presently has, is what we term as Less Critical

But we need notifications for those too

Based on the term “Warning” or “Less Critical”, our paging service will notify via email/phone and escalate it accordingly

Same for Appdex score

@NateHeinrich [9:45 AM]
I see. What service manages your on-call/escalation rules?

@Anup.Jishnu [9:45 AM]
if it falls below 0.5 r below 0.2
Not sure if I understood your question

@NateHeinrich [9:46 AM]
PagerDuty, Opsgenie, VictorOps, xMatters?

@Anup.Jishnu [9:46 AM]
Are you asking for the tool used? - PagerDuty & VictorOpss

@NateHeinrich [9:46 AM]
cool, great folks at both of those places

@hross [9:47 AM]
We had a another question that a community member sent in via email that I would love to see answered too - do you have any tips for managing alarm noise?

@NateHeinrich [9:48 AM]
One of my biggest tips here is to consider leveraging NR Alerts ability to aggregate violations into incidents within a policy.

Alert policies have a preference that customers can set, ranging from no aggregation to high aggregation.

And since notifications are sent on incident open/ack/close this can save you from getting dozens of pages within a short amount of time for the same issue.

@Anup.Jishnu [9:50 AM]
How do we set aggregation?

@NateHeinrich [9:51 AM]
Additionally, if you have metrics that are trending up/down or have weekly seasons/cycles that you haven’t been able to alert on in the past with static thresholds, baselines can help here.

@Anup.Jishnu [9:51 AM]
I am looking at NR Alerts screen and it is not clear where it can be set from

@NateHeinrich [9:51 AM]
@aj go into a policy, then click on ‘incident preference’ at the top right (edited)

@Anup.Jishnu [9:52 AM]
ah … that one … I assumed something similar to “wait for 10 events” before creating incident.

@NateHeinrich [9:53 AM]
uploaded and commented on this image: Consider this throughput metric

With baselines you can get notified if traffic spikes at night (when it normally doesn’t) or if traffic dips during the day (which would also be an anomaly)

@NateHeinrich [9:54 AM]
Incident preference can be a bit tricky to understand, we’re looking into way to explain it better

@Anup.Jishnu [9:51 AM]
Baseline alerts - I want to study that more

@NateHeinrich [9:55 AM]

Baselines are awesome. I’m doing a talk on them at Futurestack New York if you’re able to attend (will probably be recorded too).

@David.Morris [9:57 AM]
are there any plans to give non-baseline alerts more flexibility in the time dimension?

@NateHeinrich [9:58 AM]
like rolling aggregates?

@David.Morris [9:59 AM]
that kind of thing, yeah

@NateHeinrich [10:00 AM]
we’ve talked about this yes, especially for NRQL conditions where you’re more likely to heavily filter your signal creating a sparse (more zeros) in the time series

@Anup.Jishnu [10:00 AM]
Thanks @nheinrich & @hross, this was a good session for me and would have loved to stay on, but duty calls so dropping off.

@NateHeinrich [10:01 AM]
one approach could be to add 5min rolling aggregates to keep the signal off the floor

Do you have an example of your own you can share?

@hross [10:04 AM]
While @David.Morris works on that response, I’ll share that we have great tips and tricks for Alerts in the Level Up section of the Community:
https://discuss.newrelic.com/c/level-up-relic-solutions/level-up-alerts

@David.Morris [10:05 AM]
not from our New Relic, but in an ideal world I’d like to look at the confidence interval on the fraction of requests of a particular type, over multiple time windows—say 1/10/60 minutes

obviously that would also require confidence intervals :slightly_smiling_face:

but basically using longer time windows to trade accuracy for responsiveness

but, that has to be upstream—currently, you essentially calculate a sample, Alerts thresholds that sample, and then does some conditional logic on the last N threshold decisions

so averaging the samples is later than I want the flexibility
put simply, adding support for a SINCE/UNTIL combination other than a 1-minute window in NRQL alerts

@NateHeinrich [10:08 AM]
i see. and being able to make that trade off for higher accuracy would reduce false positives.

@jsprague [10:09 AM]
@NateHeinrich Hi! I have a question from the webinar yesterday

@David.Morris [10:09 AM]
yeah, particularly at lower-traffic periods (when false positives are most painful)

@NateHeinrich [10:09 AM]
ah yes, today NRQL conditions register a streaming query with NRDB and it publishes “tumbling 1min windows” to our alerts eval engine which holds state and does the duration and threshold computation.

if you have control over since/until and set it to last 5mins, would you expect that to publish every 5mins or every min?

@hross [10:10 AM]
@jsprague - go for it!

@jsprague [10:10 AM]
I noticed when you were showing the infrastructure alerts, you were creating an alert under the settings section of infrastructure

If you create an alert there, will it also show up in the global Alerts section in the top nav?

@David.Morris [10:11 AM]
I’d expect a 5min window published every 1min

although there would be some cases in which publishing less frequently would probably be acceptable

@NateHeinrich [10:12 AM]
@jsprague yes, today Infra alert condition management is done in the Infra UI only , but will show up in the Alerts UI within the policy you associated it with. Managing those conditions in Alerts is definitely a priority for us.

@David.Morris [10:13 AM]
I can obviously see performance implications with that, but I’m hoping that the work on baselines might brought some performance improvements for rolling calculations

@NateHeinrich [10:13 AM]
@David.Morris that’s what I’m thinking as well. at some window width TBD having 1min updates isn’t helpful and expensive to run.

agreed, baselines on sparse metrics could potentially benefit from rolling windows

prediction algorithms generally work better when values stay above zero :slightly_smiling_face:

@David.Morris [10:15 AM]
although the set of calculations you’d need to perform downstream of the aggregation is much smaller than all of NRQL—so having the NRQL publish something aggregatable and having a limited amount of aggregation in alerts would probably cover most of the use cases I’m thinking of

I think there’s also an interaction with NRQL alerts on metrics from Infrastructure integrations, because of the different sampling frequencies

@hross [10:15 AM]
We have just a few minutes left with @NateHeinrich, I love sharing how we use New Relic at New Relic, so @NateHeinrich - what’s the best way we use Alerts here?

@David.Morris [10:18 AM]
I haven’t tried it recently because it went so spectacularly wrong when I first attempted it, but NRQL alerts on e.g., an AWS integration metric flap madly because that has a 5min sample interval

@jsprague [10:18 AM]
@NateHeinrich any update on pulling alerts data into insights?

@NateHeinrich [10:18 AM]
My favorite examples of New Relic engineering teams using New Relic Alerts is NRQL alerts on custom events!

We have teams sending data from our super high throughput edge tier into Insights custom events using the Insert API do to some crazy things like security based alerts, abnormal traffic from certain places around the world, etc.

Also using custom attributes on Transactions to add meta data like account ids, or user ids to know when performance for certain customers is changes.

@jsprague Yes! I can’t want to put incident lifecycle data into Insights so customers can do their own analytics or even meta-alerting.

@David.Morris [10:23 AM]
a more mundane question—any idea where scheduled maintenance/downtime windows are on the roadmap?

@NateHeinrich [10:24 AM]
@jsprague we have a prototype of this running internally. We’re hoping to get to this early next year if we can.

@David.Morris We’re doing some planning around maintenance windows now actually. Would you (or anyone in here) be interested in talking to a product designer about this?

@David.Morris [10:26 AM]
absolutely

@hross [10:26 AM]
@David.Morris- I’ll get your contact info to @NateHeinrich so he can follow up!

@NateHeinrich [10:26 AM]
Cool. I reach out soon.

@hross [10:26 AM]
Down to the wire with just 4 minutes left? Anyone have anything last minute they want to ask?

Going once…

Going twice…

And we’ll call it a day! Huge thanks to you for asking great questions and to @nheinrich for bringing all the answers.

@jsprague [10:29 AM]
Thanks @hross and @NateHeinrich!

@David.Morris [10:29 AM]
thanks, very interesting

@hross [10:30 AM]
Have more questions? Don’t forget to join us in the community.

https://discuss.newrelic.com/c/alerts.

@NateHeinrich is great about jumping in there all the time!