Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Relic Solution: How do Host Not Reporting Alert Conditions Work?

alerts
hostname
levelup
infrastructure

#1

Imagine that you’ve created a Host Not Reporting (HNR) alert condition in Infrastructure (as detailed in this document), then one of your hosts goes down. You get a notification that your host is down, and at some point the next day you bring that host back online. It starts showing up in your Infrastructure dashboard again (yay!) but the HNR alert incident doesn’t close (boo!). Why is this happening?

When a host has the New Relic Infrastructure Agent (NRIA) installed on it and it first connects to our data collectors, it is assigned an arbitrary entityId. The alerts evaluation system uses this identifier to track the host. When any given host’s NRIA stops reporting to our data collectors, the alerts evaluation system considers that host to be offline, and after 5 minutes (default setting) it will open a HNR violation.

The NRIA will keep the original entityId cached for 24 hours. However, if more than 24 hours pass before the NRIA starts reporting again (regardless of whether the host comes back online before that time), the original entityId is purged and a new one is assigned. This will also happen if the host is reprovisioned – even though the same hostname is used, a new entityId will be assigned. In either of the above cases, any open HNR violation will remain open indefinitely, as the alerts evaluation system is waiting for that original entityId to come back online.

OK, that’s good to know. But how do I close the open HNR violation?

The best thing to do in this case is to simply disable, then re-enable, the HNR condition. This will not only close any open violations, it will re-sync the condition with hosts that are actually online and scope to the appropriate entityIds of those hosts.

But what about my CPU%/LoadAvg/Memory etc. alert condition? Why isn’t that closing?

Good question! Often, just before a host dies, one of the indicators of host health will go haywire and cause one or several other alert conditions to open violations. An example of this would be CPU%. Perhaps you have an alert condition set up with a threshold of CPU% > 85. If the host’s CPU% got up to 95%, subsequently causing this condition to open a violation right before it went offline, then something happened so that a new entityId was created, this violation will never close.

While a host is offline, our data collectors are receiving null values from the NRIA. null values are not equal to any numeric values. So, although it appears that your CPU% has dropped to 0 and the alert violation should close, from our alerts evaluation system’s point of view, it hasn’t reported any numeric values since it was up around 95%.

If the host comes back online soon enough that it keeps the original entityId, then these sorts of violations will close automatically after it comes back online and the NRIA reports good data for a few minutes (depending on how you have your alerts thresholds set). However, if a new entityId gets assigned, you will need to disable and re-enable this type of condition in order to get the violation to close.


NewRelic AgentHostDown Alert
"Host not responding" does not auto resolve after parameters should clear
"Unavailable Hosts" But Machine And Agent Are Online
Hosts not reporting violation isn't closing itself
#2

Hello, can you define what “host is reprovisioned” means?

Can New Relic set a longer time-out window than 24 hours? Sometimes we have to take a server offline for longer than that or maybe a repair takes longer than that.

Can you include an Insights query example to verify when this issue occurs?

Can this article also be linked to the alerts?

Thanks


#3

Hi @bgaudreault,

I’m going to split out your questions so that I can address them one at a time.



Hello, can you define what “host is reprovisioned” means?

In most cases, it simply means that the fully-qualified domain name (FQDN) for the host changed, however it can refer to any time the overall fingerprint of the host changes. The New Relic Infrastructure Agent’s (NRIA) behavior with regards to hostname lookup will be getting an improvement with the newest (yet to be released) version of the agent, along with a new config setting, dns_hostname_resolution, which will allow users to force the agent to only do on-host hostname lookups if they prefer that behavior, and avoid DNS lookups altogether.


Can you include an Insights query example to verify when this issue occurs?

SELECT average(cpuPercent) FROM SystemSample WHERE entityName = '[HOST_NAME]' FACET entityId TIMESERIES

This query will show two different facets if entityId changes over the time window you specify in the query.


Can New Relic set a longer time-out window than 24 hours? Sometimes we have to take a server offline for longer than that or maybe a repair takes longer than that.

Although I can’t comment on when or even if this feature will be released, our engineering team is looking at ways to allow entityId values to persist if a host returns with the same “fingerprint.” This would make it so that there is no more arbitrary 24-hour cutoff for entityId.


Can this article also be linked to the alerts?

I would encourage you to start a thread in our Feature Ideas section for both Infrastructure and Alerts. This will allow other customers to provide workaround suggestions or even vote on your idea. Product Managers are active in these forums, often judging which features to include next by looking to see which ones have the most votes.



I hope these answers help. Let us know if more questions crop up.