@Anup.Jishnu [9:35 AM]
yes. My main question is with regards to notifications
Notifications on Less Critical events
It is a bit complex at present to manage Critical and less critical notifications
any plans to ease it?
@NateHeinrich [9:38 AM]
Ah, yes. I hear about that fairly frequently. Though I have a strong opinion that all notifications should require immediate action, I know there are workflows out there that require things like warnings event to be sent to “low priority” channels.
@Anup.Jishnu [9:40 AM]
Yes all notifications need immediate attention, Critical notifications would trigger “Major Incident Process” and would involved management, Less Critical would be a heads up to DevOps or Dev team to jump in and fix stuff before it becomes Critical
@NateHeinrich [9:40 AM]
We’ve been talking internally about a few approaches that can give customers some of this flexibility as an option. However, it isn’t on the immediate roadmap.
@NateHeinrich [9:41 AM]
Totally makes sense. You have two different workflows in your company that need to be kicked off based on the type of violation that occurred. @Anup.Jishnu curious, does one process escalate into another if severity increases during the incident?
@Anup.Jishnu [9:42 AM]
Sure, e.g: CPU is spiking and once it crosses 70% its a Less Critical event. But if it hits 90% it is Critical.Due to
the way we have set up servers it should not normally cross 50%.
So yes Less Critical escalates to Critical
The Warning threshold that NR presently has, is what we term as Less Critical
But we need notifications for those too
Based on the term “Warning” or “Less Critical”, our paging service will notify via email/phone and escalate it accordingly
Same for Appdex score
@NateHeinrich [9:45 AM]
I see. What service manages your on-call/escalation rules?
@Anup.Jishnu [9:45 AM]
if it falls below 0.5 r below 0.2
Not sure if I understood your question
@NateHeinrich [9:51 AM]
Additionally, if you have metrics that are trending up/down or have weekly seasons/cycles that you haven’t been able to alert on in the past with static thresholds, baselines can help here.
@Anup.Jishnu [9:51 AM]
I am looking at NR Alerts screen and it is not clear where it can be set from
@NateHeinrich [9:51 AM] @aj go into a policy, then click on ‘incident preference’ at the top right (edited)
@Anup.Jishnu [9:52 AM]
ah … that one … I assumed something similar to “wait for 10 events” before creating incident.
@NateHeinrich [9:53 AM]
uploaded and commented on this image: Consider this throughput metric
With baselines you can get notified if traffic spikes at night (when it normally doesn’t) or if traffic dips during the day (which would also be an anomaly)
@NateHeinrich [9:54 AM]
Incident preference can be a bit tricky to understand, we’re looking into way to explain it better
@Anup.Jishnu [9:51 AM]
Baseline alerts - I want to study that more
@David.Morris [10:05 AM]
not from our New Relic, but in an ideal world I’d like to look at the confidence interval on the fraction of requests of a particular type, over multiple time windows—say 1/10/60 minutes
obviously that would also require confidence intervals
but basically using longer time windows to trade accuracy for responsiveness
but, that has to be upstream—currently, you essentially calculate a sample, Alerts thresholds that sample, and then does some conditional logic on the last N threshold decisions
so averaging the samples is later than I want the flexibility
put simply, adding support for a SINCE/UNTIL combination other than a 1-minute window in NRQL alerts
@NateHeinrich [10:08 AM]
i see. and being able to make that trade off for higher accuracy would reduce false positives.
@David.Morris [10:09 AM]
yeah, particularly at lower-traffic periods (when false positives are most painful)
@NateHeinrich [10:09 AM]
ah yes, today NRQL conditions register a streaming query with NRDB and it publishes “tumbling 1min windows” to our alerts eval engine which holds state and does the duration and threshold computation.
if you have control over since/until and set it to last 5mins, would you expect that to publish every 5mins or every min?
@jsprague [10:10 AM]
I noticed when you were showing the infrastructure alerts, you were creating an alert under the settings section of infrastructure
If you create an alert there, will it also show up in the global Alerts section in the top nav?
@David.Morris [10:11 AM]
I’d expect a 5min window published every 1min
although there would be some cases in which publishing less frequently would probably be acceptable
@NateHeinrich [10:12 AM] @jsprague yes, today Infra alert condition management is done in the Infra UI only , but will show up in the Alerts UI within the policy you associated it with. Managing those conditions in Alerts is definitely a priority for us.
@David.Morris [10:13 AM]
I can obviously see performance implications with that, but I’m hoping that the work on baselines might brought some performance improvements for rolling calculations
@NateHeinrich [10:13 AM] @David.Morris that’s what I’m thinking as well. at some window width TBD having 1min updates isn’t helpful and expensive to run.
agreed, baselines on sparse metrics could potentially benefit from rolling windows
prediction algorithms generally work better when values stay above zero
@David.Morris [10:15 AM]
although the set of calculations you’d need to perform downstream of the aggregation is much smaller than all of NRQL—so having the NRQL publish something aggregatable and having a limited amount of aggregation in alerts would probably cover most of the use cases I’m thinking of
I think there’s also an interaction with NRQL alerts on metrics from Infrastructure integrations, because of the different sampling frequencies
@hross [10:15 AM]
We have just a few minutes left with @NateHeinrich, I love sharing how we use New Relic at New Relic, so @NateHeinrich - what’s the best way we use Alerts here?
@David.Morris [10:18 AM]
I haven’t tried it recently because it went so spectacularly wrong when I first attempted it, but NRQL alerts on e.g., an AWS integration metric flap madly because that has a 5min sample interval
@NateHeinrich [10:18 AM]
My favorite examples of New Relic engineering teams using New Relic Alerts is NRQL alerts on custom events!
We have teams sending data from our super high throughput edge tier into Insights custom events using the Insert API do to some crazy things like security based alerts, abnormal traffic from certain places around the world, etc.
Also using custom attributes on Transactions to add meta data like account ids, or user ids to know when performance for certain customers is changes.
@jsprague Yes! I can’t want to put incident lifecycle data into Insights so customers can do their own analytics or even meta-alerting.
@David.Morris [10:23 AM]
a more mundane question—any idea where scheduled maintenance/downtime windows are on the roadmap?
@NateHeinrich [10:24 AM] @jsprague we have a prototype of this running internally. We’re hoping to get to this early next year if we can.
@David.Morris We’re doing some planning around maintenance windows now actually. Would you (or anyone in here) be interested in talking to a product designer about this?