Imagine that you’ve created a Host Not Reporting (HNR) alert condition in Infrastructure (as detailed in this document), then one of your hosts goes down. You get a notification that your host is down, and at some point the next day you bring that host back online. It starts showing up in your Infrastructure dashboard again (yay!) but the HNR alert incident doesn’t close (boo!). Why is this happening?
When a host has the New Relic Infrastructure Agent (NRIA) installed on it and it first connects to our data collectors, it is assigned an arbitrary
entityId. The alerts evaluation system uses this identifier to track the host. When any given host’s NRIA stops reporting to our data collectors, the alerts evaluation system considers that host to be offline, and after 5 minutes (default setting) it will open a HNR violation.
The NRIA will keep the original
entityId cached for 24 hours. However, if more than 24 hours pass before the NRIA starts reporting again (regardless of whether the host comes back online before that time), the original
entityId is purged and a new one is assigned. This will also happen if the host is reprovisioned – even though the same hostname is used, a new
entityId will be assigned. In either of the above cases, any open HNR violation will remain open indefinitely, as the alerts evaluation system is waiting for that original
entityId to come back online.
OK, that’s good to know. But how do I close the open HNR violation?
The best thing to do in this case is to simply disable, then re-enable, the HNR condition. This will not only close any open violations, it will re-sync the condition with hosts that are actually online and scope to the appropriate
entityIds of those hosts.
But what about my CPU%/LoadAvg/Memory etc. alert condition? Why isn’t that closing?
Good question! Often, just before a host dies, one of the indicators of host health will go haywire and cause one or several other alert conditions to open violations. An example of this would be CPU%. Perhaps you have an alert condition set up with a threshold of
CPU% > 85. If the host’s CPU% got up to 95%, subsequently causing this condition to open a violation right before it went offline, then something happened so that a new
entityId was created, this violation will never close.
While a host is offline, our data collectors are receiving
null values from the NRIA.
null values are not equal to any numeric values. So, although it appears that your CPU% has dropped to 0 and the alert violation should close, from our alerts evaluation system’s point of view, it hasn’t reported any numeric values since it was up around 95%.
If the host comes back online soon enough that it keeps the original
entityId, then these sorts of violations will close automatically after it comes back online and the NRIA reports good data for a few minutes (depending on how you have your alerts thresholds set). However, if a new
entityId gets assigned, you will need to disable and re-enable this type of condition in order to get the violation to close.