Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

How to alert on Multiple Hosts not reporting?


#1

There are 19 nodes in a cluster - id like to create an Alert that triggers if any 3 of the hosts in a group fail to report. Is this possible?

Following this: https://docs.newrelic.com/docs/infrastructure/new-relic-infrastructure/infrastructure-alert-conditions/create-infrastructure-host-not-reporting-condition - i can create an Alert Condition on any hosts in a group - but it will alert on any 1 failing for x amount of minuts

Id like to only alert if the number of hosts past some threshold fail to report after some time

Thank you


#2

Hi @Jerrold.Simbulan, you could use a NRQL alert in this case, you would use something like:

SELECT uniqueCount(hostname) FROM StorageSample WHERE totalUtilizationPercent = 0

Then set the threshold to when the query returns a result above 2


#3

@Jerrold.Simbulan just an update on this, that exact query won’t work because the totalUtilizationPercent will actually be null if the agent isn’t reporting, but i’ve done some testing and come up with a workaround.

If you monitor the InfrastructureEvent for hosts that have the Agent disconnected summary instead this will work:

SELECT uniqueCount(hostname) FROM InfrastructureEvent WHERE summary  LIKE '%Agent disconnected%'

The uniqueCount will return the most recent summary from each host.

The threshold should be query returns a result above 2 at least in once in… with a time that makes sense for you, but I used 5 minutes. I also set the Evaluation Offset to 5 minutes just to be safe.

This won’t tell you exactly which hosts are down, but if you follow the Insights Link in the Incident Email:

And change the NRQL to:

SELECT hostname FROM InfrastructureEvent WHERE summary  LIKE '%Agent disconnected%' SINCE ... UNTIL ...

This will give you a list of hosts that most recently reported Agent Disconnect.

I hope this helps, let me know how you get on with this :slight_smile:


#4

it may actually be better to use sum of query results over a longer period of time, to make sure it captures all disconnects


#5

@rdouglas - ok will give this a shot - will report back!


#6

Thanks @Jerrold.Simbulan - looking forward to hearing back on how that goes.