Infrastructure alerts not autoclosing

Hi all,

In the past couple of weeks I’ve noticed quite a few infrastructure agent alerts (i.e. host not reporting) not closing even though the panels are back online and reporting data.

Has anyone else experienced this? The whole experience is feeling very buggy in the past few month and now I keep having to go over a few nodes and manually close the violations before they start cropping up again.

Please see example below. Nothing has changed in our set up and these units have been reporting fine for the last few months.

Hi @lee9 - Thanks for reporting this. We currently have some open issues with Infrastructure Host Not Reporting alert conditions. There has also been a change to how these conditions work. Previously, a host not reporting condition looked for the presence of an actual agent disconnect event. This is no longer the case. We now watch signals from a faceted NRQL query, and when those signals disappear, violations open as appropriate.

This means a significant change in behavior with regards to tags. A host and its tags now constitute a signal, and a change to the signal can result in an unexpected violation. An example would be, let’s say you have a condition set up to include hosts where the tag name equals xxx . If you change the condition so that it instead targets hosts where name is equal to yyy , the previous signals will disappear, and host not reporting violations will open for the hosts that were previously targeted. To work around this behavior, one could disable the alert condition before updating the tags, allowing it to start fresh without the baggage of prior signals.

There is also an open issue that we are working to address where a host may start up without having collected all of its metadata. If a host is missing a tag on startup, but eventually gains that tag after a period of time, an alert condition with an exclusionary filter on that tag will pick up the host as a signal and subsequently open a violation when the tag is added and the signal is lost.

Thanks dokyano,

It’s causing me issues as I’ve had a site go offline more than 20 times in the last few weeks which is affecting my uptime reporting as the panel itself hasn’t been offline at all.

It’s also very strange that some violations auto close very quickly but then a few minutes later the same violation will open up again and not close. I have no historic open violations for these sites yet it’s still opening unnecessary violations and ignores the fact it’s receiving data back consistently. During business hours it isn’t an issue as these can be closed off manually - but over a weekend I’ll suddenly have 20 hours downtime when no such issue exists.

Just to add that on the one hand whilst the IA is reporting issues that don’t exist, I’m not receiving reports of actual issues where processes are not running despite having the necessary policies in place and these reporting in previously.

I have to say over the last 2-3 months New Relic has really regressed and I find it infuriating as I can’t trust this product any more.

Whatever has been done in the background in the recent updates has been to the absolute detriment of the product.

I have 100s of hours of Infrastructure Agent not reporting when no downtime was experienced but processes I’m actively monitoring are not running for hours and nothing is notified to me.

One more point just to highlight the futility of this …

3 violations opened in the last hour, the first two were closed manually. As you can see from the data at the bottom this agent has been reporting constantly throughout. This should not be an alert.

Hello - is there an update on this?

The way the IA is behaving has rendered New Relic useless for our purposes and I’m going to have to look at alternative solutions in its place.

Hello again - New Relic is continuing to deteriorate even more now as a host of screens are now appearing with a grey dot on the list view in Explorer even though they’re continuing to send data.

This means that these don’t appear on the Navigator view and I assume will not report any alerts I’ve set up for them.

I’ve attached two screenshots showcasing this with the second image being the information being fed back from the site highlighted in red on the first image.

As you can see when I access the node there are no violations open and the information is being fed back correctly - what is going on please? Why is my New Relic experience so bad?

How has it continued to degrade so much?

Hi @lee9 - Sorry to hear you are running into so many troubles here. Can you provide links to the alert conditions you are having trouble with. From what you are describing it sounds like you have a Host Not Reporting condition and a Process Running alert condition.

Hi dkoyano and thank you for your reply.

All the issues I’m experiencing currently are down to the infrastructure agent - link to the policy is here

To me it seems that the information is being collated correctly by the IA however the way this is interpreted in the New Relic UI is completely wrong.

In the policy above there are two conditions - although the top one is disabled and only the bottom one is active. As you’ll see this is just a basic ‘Host not reporting’ alert set to 10 minutes.

The issue is that the UI is returning grey dots on a bunch of sites, which then removes them from Navigator view. When accessing these sites in Explorer you can see data being returned fine.

In addition we have a lot of sites opening up as critical violations of the ‘host not reporting’ condition even though data is continuing to be fed in without interruption.

Hi all - please is there anything you can suggest to rectify this issue?

I would be absolutely delighted to go through a screen share as these problems keep persisting and I can’t use New Relic to accurately report on units with false alerts and false downtime.

Your help would be most appreciated on this matter as this has been persisting for weeks now and shows no signs of improvement.