Below, I’ll refer to “violations” and “incidents.” Keep in mind that these terms are for the legacy alerting experience. The newer systems that are now available use different terminology, which I go through in this post.
Incidents, which to this point have not had any limitation applied to how long they are allowed to be open, will now have a 30-day time limit (“time to live,” or TTL). All currently open incidents will automatically close after 30 days, so all violations within an incident will be force-closed once the incident reaches 30 days old. If there are still ongoing symptoms with your systems, your alert conditions will cause new violations to open, and these will roll up into new incidents, resulting in new notifications being sent out.
IMPORTANT NOTE: The “incident TTL” can override your violation TTL, if a new(er) violation rolls up into an old(er) incident. This will result in a violation that seems to close early because the parent incident was closed. If the metrics that caused the violation are still present, a new violation will open as soon as the threshold requirements are met.
The main reason for this change is because some of our users are not being notified about new problems (new violations roll up into old incidents and do not send a notification), and to align the behavior of legacy incidents with new issues (issues already have a time-to-live of 30 days).