Thanks for the different take on this feature idea @tim.davis - I’ll get that filed internally.
I was surprised to see that warning alerts aren’t a possibility. Nobody I know has the spare time to watch a screen in addition to their production work. FWIW, Sensu has a really great working model when it comes to different handlers for different thresholds, if you’re looking for some design ideas. Hope you get this enabled soon!
Thanks for the feedback @inger.klekacz - We’ll get that added here with your +1 for this feature.
We would really like this too so we can use Warnings sent to our OpsGenie integration which then creates a JIRA ticket, so our team can prevent issues before they become an alert.
I’m actually really surprised that a monitoring solution doesn’t already provide this feature.
Hey @steve.townsend - Your +1 has been added.
For now, is the option of a separate condition with a lower critical threshold (acting as a warning) ok for you to try out?
I don’t see any other option. I’m still pretty shocked this isn’t something customers are blowing up about.
Thanks for the reply.
You’re right - it’s an old feature request, and a quite highly requested feature. I can’t speak to the roadmap as I’m not aware of what’s on there.
But I can get these messages sent over to the Alerts product development team for you.
Have a lovely day and stay safe
We are using conditions setup as critical everywhere and put them in different alert policies. What changes is the alert policy name and the notification method.
An ugly workaround which duplicates our alert policies… sad but true
That’s the best available workaround right now. Understanding it’s not ideal, I’m glad it is at least working for you.
Yes the workaround works for now, but it is not a viable solution. We want to be able to notify developers when the warning threshold is reached to avoid the problem from reaching a critical point. This can be achieved by creating a duplicated alert policy (suggested workaround)
Problem with the Workaround:
- Need to maintain a duplicated set for all alert policies (high chance of creating unsynced updates)
- Warning threshold are completely useless.
- The event list is filled up with critical event (which is not true)
- We need to add “Critical”/“Warning” prefix to all all the alert conditions in order to know the level of a notification
- All the graphs show red area instead of yellow since the workaround requires us to use critical threshold…
I hope it gets integrated soon. It’s a pain for both users and admins.
Thanks and keep up the good work!
Thanks for the very detailed and helpful input @fpoulin - We’ll get those thoughts logged for you.
But seriously ? This makes me want to about face back to Datadog
@matt.butalla - Can you outline your use case for this? I’ll be able to get that logged internally for the PM to see.
With the advent of Workloads, it particularly emphasizes the need to better convey warning threshold violations beyond New Relic.
The traffic-light system is really great to give a quick overview of systems/alerts at a glance… but it’s not always possible to monitor the page, especially with several teams split over different accounts.
As such, it’d be ideal to trigger notifications at both thresholds with relevant severities set in the alert payload to allow appropriate events routing within integrations, such as PagerDuty.
Yes, absolutely. Currently we have warning and critical type alerts coming into our team’s Slack channel with Datadog. Various standard type things like queue depth, system utilization (disk space, mem, etc.) have warning and critical levels associated with them. Warnings and criticals both deliver Slack messages, but criticals page us via Pager Duty. Obviously we strive to set our system up in such a way that it can adapt (e.g. autoscaling) so that we don’t have to get paged to push a button and scale systems, but sometimes things don’t always work that way. We have had on a few occasions where a warning message was delivered to Slack from Datadog, and we were able to address issues during normal business hours, before they became a critical problem. Conversely though, if systems degrade so badly and so quickly, we want to get a page in the middle of the night so that failures do not cascade. To me this is the benefit of having an alert set up with warnings and criticals both delivered to something like Slack, and criticals delivered only to Pager Duty. Hopefully that summarized our current state with DD. What I don’t see in NR is a way to make this work with warning and critical thresholds. A single alert seems to require a single set of notification channels (via an alert policy), where critical will always deliver to all channels and warning doesn’t go anywhere but the NR dashboard. I understand that engineers could make it a habit to visit the NR dashboard in the morning when checking email and such, but I know that won’t happen (including myself here). We’re always in Slack every day, and having warnings delivered to Slack (and only to Slack) would enable the alerts to be in our face and address any issues before they become critical. Right now the only way I see to do this in NR is to create two alerts that are duplicates, aside from different thresholds and alert policies. This seems cumbersome and while doable, less than ideal for maintenance. After all of this typing I figured I should find a link to some documentation we’re using for our monitors. This is probably the best one: https://docs.datadoghq.com/monitors/notifications/?tab=is_alert#conditional-variables This outlines the notification variables that DD offers. We use #is_warning and #is_alert extensively across monitors. The addition of these in a single monitor allow us to tailor different <@-NOTIFICATION> endpoints for one monitor. We can set up our target metric, all of the parameters and thresholds once, and alert different channels based on the threshold crossed. Hopefully this helps, thank you for reaching out.
Thank you for such extensive detail! This is great & helps to add on & show to the product teams the level of demand for this feature I’ve got all of that logged internally for both of you now
It is well past due for some product management attention to this thread.
We are starting the process of redesigning the notification services and related user experience. When doing this, we will allow all messages and titles to be fully configurable, and we will support sending notifications on warning messages to configurable notification destinations. If you have upgraded to our Incident Intelligence service, some of this is available now.