Few active hosts turned gray w/ status “not reporting”

We have 10 Linux hosts reporting metrics data to NR dashboard via NR monitoring agent/service running on these respective hosts. Suddenly on 03/25 Friday - the status for few of them changed to GRAY from GREEN.

I have checked the service status and it is running properly. I had tried restarting the service as well but that didn’t worked out. Just to rule out, validated the network/firewall settings related to these servers in terms of NR monitoring agent to work (outbound access to NR domains, networks, and ports) and they are all good, there are no recent changes made plus everything is working as it was until they were GREEN.

Interestingly, I can see that the servers still reporting data to NR where the numbers for CPU utilization, Memory/Disk used and Network Traffic etc. are changing as well.

Now, question is why they turned to GRAY w/ “no data reporting” when ideally they are. Is that a NR platform related issue or something happened internally that needs to be fixed? Also, where I can find entries related to NR monitoring or Infrastructure Agent in server?

$ rpm -qa | grep newrelic-infra
newrelic-infra-1.14.2-1.el7.x86_64

$ sudo systemctl status newrelic-infra

newrelic-infra.service - New Relic Infrastructure Agent
Loaded: loaded (/etc/systemd/system/newrelic-infra.service; enabled; vendor preset: disabled)
Active: active (running) since Sat 2022-03-12 11:21:16 EST; 1 weeks 6 days ago
Main PID: 1640 (newrelic-infra-)
Tasks: 21
Memory: 83.8M (limit: 1.0G)
CGroup: /system.slice/newrelic-infra.service
|-1640 /usr/bin/newrelic-infra-service
`-1653 /usr/bin/newrelic-infra
Mar 23 16:14:15 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-23T16:14:15-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 05:27:39 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T05:27:39-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 07:25:39 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T07:25:39-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 12:18:59 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T12:18:59-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 12:23:28 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T12:23:28-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 21:41:44 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T21:41:44-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 24 23:07:49 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-24T23:07:49-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 25 06:14:15 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-25T06:14:15-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 25 06:20:10 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-25T06:20:10-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Mar 25 12:42:34 atlxxxxxxxxx newrelic-infra-service[1640]: time=“2022-03-25T12:42:34-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
Hint: Some lines were ellipsized, use -l to show in full.
[harsharm@atlxxxxxxxxx ~]$ sudo systemctl restart newrelic-infra

$ sudo systemctl status newrelic-infra

newrelic-infra.service - New Relic Infrastructure Agent
Loaded: loaded (/etc/systemd/system/newrelic-infra.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-03-25 16:06:08 EDT; 2min 48s ago
Main PID: 5952 (newrelic-infra-)
Tasks: 20
Memory: 42.0M (limit: 1.0G)
CGroup: /system.slice/newrelic-infra.service
|-5952 /usr/bin/newrelic-infra-service
`-5959 /usr/bin/newrelic-infra
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=services/supervisord
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=system/network_interfaces
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=kernel/sysctl
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=kernel/modules
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=services/pidfile
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=config/sshd
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=packages/rpm
Mar 25 16:06:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=config/selinux
Mar 25 16:06:15 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:06:15-04:00” level=info msg=“connect got id” agent-guid=Mjk4NzAzOXxJTkZSQXxOQXw4OTg2OTk2NDE1NDQ1NjU1NjU4 agent-id=898699641544565565…ConnectService
Mar 25 16:07:14 atlxxxxxxxxx newrelic-infra-service[5952]: time=“2022-03-25T16:07:14-04:00” level=info msg=“Integration health check finished with success” component=integrations.runner.Runner integration_name=nri-docker
Hint: Some lines were ellipsized, use -l to show in full.

Hi @dl.raceday-support

The gray health status can also mean that there are no alert conditions related to the entity, as you can check here.

Probably the alert conditions related to the entities were disabled or deleted


I’ll be happy to check them for you. To this, please just send me the link.

thanks

Rodrigo

1 Like

Hello Rodrigo @rcorgozinho , Thanks for your response.

I am using team meetings if that is fine w/ you. Please share the best time to connect? I am available from 8 am - 6 pm Central Time

Where exactly on NR I can find associated or create an alert conditions for the hosts?

Regards,
Harsh

Hi, @dl.raceday-support: Are you the same person as @harsh.sharma1? It looks like you have this same question in another topic:

If you are the same person, I would like to combine these topics, so we do not have the same discussion happening in two places. Thank you!

@philweber Oh yeah sure, please feel free to combine these two under this account # @dl.raceday-support

Thanks !

Hi All, We have few hosts added in NR and all of sudden few of them greyed out and rest still showing as green.

If I click on the grey host, it shows NOT REPORTING. I have checked the “systemctl status newrelic-infra” on the host server. Its up and running. Restarted as well.

any suggestion please?

Hello @harsh.sharma1

Thank you for getting in touch.

Had you clicked on the hosts that are greyed out to make sure if metrics are being sent?

If you go New Relic One >> infrastructure >> hosts if you click on the hosts you mentioned that are grey out are those reporting metrics?

I hope it helps.

1 Like

Hi @vhenrique, thanks for your message.

The Ques you’d asked - If you go New Relic One >> infrastructure >> hosts if you click on the hosts you mentioned that are grey out are those reporting metrics?

Ans: Yes, when I click or open any of the host that is greyed out, I can see variation in metrics like CPU/Memory/Storage usage or N/W Tx Rx traffic.

The metric data or number are different from the previous noted metrics. Meaning the are continuously changing the way there were when the Hosts were GREEN. But still these hosts are grey now.

Does that means NR Infra monitoring agent is working fine and the host (greyed out) is still reporting to NR?

Also, where are the NR infra. monitoring logs located on the HOSt server (linux) ? I was wondering to cat the logs to find the exact error message, if any?

Sorry for all such questions, I am new to NR and this is my first time encountering this issue.

FYI - The health status transitioned from GREEN to GRAY

I tried restarting the “newrelic-infra” service on the impacted host but that doesn’t changed anything.

Before:
sudo systemctl status newrelic-infra

  • newrelic-infra.service - New Relic Infrastructure Agent
    Loaded: loaded (/etc/systemd/system/newrelic-infra.service; enabled; vendor preset: disabled)
    Active: active (running) since Sat 2022-03-12 11:31:38 EST; 1 weeks 6 days ago
    Main PID: 1661 (newrelic-infra-)
    Tasks: 22
    Memory: 88.9M (limit: 1.0G)
    CGroup: /system.slice/newrelic-infra.service
    |-1661 /usr/bin/newrelic-infra-service
    `-1675 /usr/bin/newrelic-infra
    Mar 23 14:19:08 newrelic-infra-service[1661]: time=“2022-03-23T14:19:08-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
    Mar 23 15:53:04 newrelic-infra-service[1661]: time=“2022-03-23T15:53:04-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new
    Mar 24 05:24:49 newrelic-infra-service[1661]: time=“2022-03-24T05:24:49-04:00” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post "https://infra-api.new

AFTER RESTART:

sudo systemctl status newrelic-infra

  • newrelic-infra.service - New Relic Infrastructure Agent
    Loaded: loaded (/etc/systemd/system/newrelic-infra.service; enabled; vendor preset: disabled)
    Active: active (running) since Fri 2022-03-25 16:06:08 EDT; 2min 48s ago
    Main PID: 5952 (newrelic-infra-)
    Tasks: 20
    Memory: 42.0M (limit: 1.0G)
    CGroup: /system.slice/newrelic-infra.service
    |-5952 /usr/bin/newrelic-infra-service
    `-5959 /usr/bin/newrelic-infra

Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=services/supervisord
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=system/network_interfaces
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=kernel/sysctl
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=kernel/modules
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=services/pidfile
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=config/sshd
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=packages/rpm
Mar 25 16:06:14 newrelic-infra-service[5952]: time=“2022-03-25T16:06:14-04:00” level=info msg=“Agent plugin” plugin=config/selinux
Mar 25 16:06:15 newrelic-infra-service[5952]: time=“2022-03-25T16:06:15-04:00” level=info msg=“connect got id” agent-guid=Mjk4NzAzOXxJTkZSQXxOQXw4OTg2LYUOaNDE1NDQ1NjU1NjU4 agent-id=898699641544565565…ConnectService
Mar 25 16:07:14 newrelic-infra-service[5952]: time=“2022-03-25T16:07:14-04:00” level=info msg=“Integration health check finished with success” component=integrations.runner.Runner integration_name=nri-docker
Hint: Some lines were ellipsized, use -l to show in full.

Hello @harsh.sharma1 ,

Thank you for pasting your logs and query!

The infra agent is working as expected as there are metrics available also the logs look fine here.

the error ” level=error msg=“metric sender can’t process” component=MetricsIngestSender error="error sending events: Post " this is common behavior of infra agent as well.

Th grey Icon is shown because that entity or host might not be linked to any alert which is mentioned in the above reply .

Best Regards
yashaswi verma

Thanks Yashaswi @yverma1 for confirming that. Could you please help me understand where we can check, create or link the host to any alert in NR to validate if its turning GREEN again?
Also, could you please help me understand where I can explore the logs in the host server related to New Relic?

Thanks,
Harsh

Hello @harsh.sharma1 , You can see how to create alert here in the ALERTS DOC.

Please set up an alert for that entity and see if the status is green or red its working .

Best Regards
Yashaswi verma

Thanks for the reply Yashaswi @yverma1
We do have existing policies running from an year for CPU, Memory and Disk usage and they still exists. Now, only 3 out of 10 hosts shows green.

attached is the screenshot of an existing policy.

I also tried creating a new one and have a question - Can’t I add all hosts in one policy because it is either allowing me to filter any specific host or leave it as it is under “FILTER HOSTS”. To make this for all I left it as it is and nothing changes, other hosts are still gray. FYI the old policies have the same settings and still 3 out of 10 are Green.

FYI - I tried adding an extra alert condition for one of the hosts and that didn’t worked out. It still gray.

Regards
HS

@vhenrique @rcorgozinho @yverma1

Anything else we can try here. I tried creating a new alert condition for one of the hosts, that didn’t work out. I did tried the same for one of the staging host under different account, it worked.

But this is not working w. these hosts that turned gray recently.

update# I tried disabling all the existing alert condition and creating a new one w/ all hosts added again. But issue still the same. The green hosts are still green after disabling all the alert conditions and gray are still gray.

Do you think its something at New Relic side because I think its not refreshing because even w/ no alert condition in place (staging env) the hosts are green where they should turn to GRAY when you delete/disable any alert condition.

FYI, the newrelic-infra service is up and running on each host.

Hi @rcorgozinho

Here are the links for my alert conditions (FYI - they were working before and still 3 out of 10 hosts are still green:

Tried creating/enabling a new test alert condition but that didn’t work out. Also disabling/enabling again also didn’t work out. Even NR is not triggering any new condition for these hosts.

@dl.raceday-support - Thanks for the links. I created a support ticket so we can investigate this further. I will be reaching out to you for any additional information needed for our Engineering team.

Thanks Derrik @dkoyano
Looking forward to your response. As per NewRelic/documentation - prime reason for the Gray icon could be because a host might not be linked to any alert or disabled. Below are the troubleshooting actions performed so far:

  1. First, to test the alert condition functionality from scratch - opted our staging account where I created a test alert condition to monitor the CPU threshold all staging hosts.

  2. The moment it was created, the Gray icon for all included hosts changed to Green and vice versa (once the condition was deleted).

  3. Now, the Prod account where this issue is happening, have existing 3 alert conditions. To start with, I disabled those and created new conditions w/ alert policies (CPU/Mem/Disk) and attached to all production hosts.

  4. But the issue remained the same and it didn’t work out.

  5. Interestingly, even with ZERO alert conditions in place, the Green nodes were still Green which should also turn GRAY as their monitoring was off.

  6. It could be a platform or UI issue, but it worked with the staging environment.

  7. FYI - The infra agent is working as expected as there are metrics are available which are continuously reported to NR.

Hey there @dl.raceday-support and @dkoyano - Don’t forget to update us here when you get this sorted so we can share the solution with the community!

2 Likes

NR had refreshed the host entities for my respective account to show them online again.

There was an issue on New Relic backend where the status was stuck in a grey status despite metrics being received from the agent. As per NR, it happens very rarely and their engineering /development team is working on a permanent fix.

So glad to hear this got resolved @dl.raceday-support and I am very grateful to you for sharing the outcome!

1 Like

Hello, @dl.raceday-support

I have the same problem, what should I do to solve it?