Alerts not sending notifications

I’ve got an alert policy setup to send a slack notification when the NRQL results return a value below a threshold. There have been several cases in the past week which should have resulted in a notification going out, but nothing happened.

https://onenr.io/0qQapxlzVj1

I’ve read through the alert setup docs and a few post (one two) and have checked that

  • No prior incidents are open
  • Changed indecent preference from “By policy” to “By condition and signal”
  • Verified test notifications can be sent to slack

This alert policy has worked in the past (April 8th and April 11th) resulting in slack notifications (both incidents are now closed).

I did create a new app after those the prior incidents worked, since the prior app was shared with a larger app. The query driving the alerts was updated to reference the new app id, but otherwise with no other logical changes.

I’ve tried a few other random things (delete and recreated the alert policy and messing with thresholds or aggregation windows) but no luck.

So I’m wondering if there’s some other setting/configuration that’s missing, or if there are open incidents that I can’t find, or something related to the new app id being created that’s causing this unexpected behavior.

Hi @jbatorski thanks for all the info on this ask. I just dove in a bit to this and I do see that the current condition in question is to look for below 90% for at least 5 minutes.

These 2 incidents in question
April 8th and April 11th seem to be looking for below 98% for at least 5 minutes, and looking at the query attached to those 2 incidents, the query is different then the one in the current condition we are looking at where we are looking for below 90% for at least 5 minutes.
Can you give us a specific time period in which you thought that the condition should of violated this week? Time zone, Maybe a time period, and day. Then we can see if indeed the condition should of violated based on the data we ingested during that period.

Best,

@sarnce Thanks for jumping in. Yeah the query and thresholds have been modified a few times since the initial events happened.

Here are the events from this week which I would have expected to trigger alerts.
https://onenr.io/01wZK771ER6

  • There was an initial issue around 8:50am local time that lasted for ~15 minutes
  • The second one started around noon local time and lasted for over 3 hours

Seems short spikes back above the threshold can break the continuous critical condition, so maybe that’s what happened in the first case, but the second one was several hours long. I’ve tried changing the window duration in the alert from 1 to 2 minutes to smooth out the data a bit, but that had no affect.

Another bit of context is the alert editor showing multiple critical violations happening in the past 6 hours, with one happening as I write this. No notifications have been sent from this alert condition today.

Current settings are

  • Open Violation when query returns below 90 for at least 5 min
  • Data Aggregation Window Duration is 1 minute

Hi @jbatorski

Apologies for the delay here, we see you and we are here to get this sorted.

Thank you for the link and the extra context here, it is super helpful. I have gone ahead and looped in the Alerts Engineer team for extra support as this is a little out of my scope.

Please note the engineer will reach out via this post with their findings.

Feel free to share any updates, fixes or questions you may have!

Looks like we have an internal ticket for this where we are working with @jbatorski

My support ticket just got resolved, so sharing the findings here for anyone else looking for the same problem in the future.

Under the "Fine-tune advanced signal settings section of the alert configuration, there’s an option called “Streaming method”. The streaming method option selections determine if client side or server side time is used when aggregating events used to trigger an incident. Docs explaining it are here.

Looks like the issue my setup was encountering, were events coming through with timestamps in the future from users who have incorrect system times. These future times were throwing off the logic used to aggregate events and determine if an incident should be triggered.

Cadence is the only method which uses server time. I think their recommendation for Event flow only make sense if you’re dealing with server generated observation data, and I’d venture to say that FE apps should only use Cadence since we can’t trust the client side system clocks.

1 Like

Hi @jbatorski,

Thank you so much for sharing your answer here. Its great to see the community helping each other on this way.

Im sure this will be of benefit to others facing this issue.

Wishing you a great day!