NR infra agent spamming "can't get sample from ProcessSampler"

Agent version: 1.0.292
Linux Distribution: red hat enterprise linux server release 6.8 (santiago)

Since starting NR infrastructure trial, our Linux servers logs get spammed with error messages “can’t get sample from ProcessSampler”. Some example rows from log:

Nov 27 03:23:06 newrelic-infra: time=“2016-11-27T03:23:06+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/27660: no such file or directory”
Nov 27 04:06:46 newrelic-infra: time=“2016-11-27T04:06:46+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/13394: no such file or directory”
… 58 similar lines …
Nov 29 08:38:36 newrelic-infra: time=“2016-11-29T08:38:36+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/23028: no such file or directory”
Nov 29 09:31:16 newrelic-infra: time=“2016-11-29T09:31:16+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/16892: no such file or directory”

Our 24/7 infra operator asked what to do with these.

Is agent misbehaving?

Is something more fundamental broken in the environment?

Or can be workaround this? (e.g. turn of this particular error reporting?)

Hi @markus.pelkonen ,

This error message would indicate that the process the Infrastructure agent is trying to collect metrics for is not available, or the agent does not have permission to access the stats on the process in /proc. If this happens many times in a short period of time, this could indicate permissions problems or an error with the agent.

However it will occur occasionally even when the agent is functioning normally. In this case it would indicate that between the time the agent gathered process IDs and the time it gathered stats for the processes, this particular process ceased to exist. Perhaps it finished, stopped, failed, etc. and is no longer there, so there are no stats to collect on it.

In the logs that you sent, there are 30 minutes to 1 hour between most of the occurrences of this error. However it looks like you collapsed many of them in the middle. If there are only a few milliseconds between each occurrence of this error this could mean the agent is failing en masse to collect stats on processes. In that case, please confirm that the newrelic-infra process has read permissions to the /proc directory and that any process managers or security settings (AppArmor, SELinux, CageFS, etc) you are using permit this process to run. If you have a lot of very fast processes it could also explain a lot of these messages in the logs, because they could be completing before the agent has a chance to collect their stats.

Great question! Let us know how it goes with resolving these error messages.

@Sarsaparilla, I’m replying on behalf of @markus.pelkonen as we work together.

Regarding permissions, user running process newrelic-infra has read access to /proc. So no problem there.

Regarding security settings, SELinux is installed but not used (getenforce shows Disabled). So no problem there.

Regarding how often error occurs, we are seeing this error around 3-10 times an hour. Our environment has processes starting and stopping frequently. We are receiving meaningful data to Infrastructure web interface from all the relevant processes. So no problem here either.

Based on above I would concur that agent is functioning normally and this error message is a bit exaggerated. I would suggest that New Relic would consider lowering the log level of this message.

@mikko.niemi

Thanks for checking through those settings! Is this message occurring with the same process id? Or is it occurring with different processes?

1 Like

This happens with different processes. There are probably many short living ones, maybe less than 5 seconds which I think is sampling rate in agent?

Two hours from /var/logs/messages as reference (server named changed) may help to see the problem:
Dec 12 09:01:21 …server… newrelic-infra: time=“2016-12-12T09:01:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/28166: no such file or directory”
Dec 12 09:03:51 …server… newrelic-infra: time=“2016-12-12T09:03:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/30431: no such file or directory”
Dec 12 09:04:21 …server… newrelic-infra: time=“2016-12-12T09:04:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/30871: no such file or directory”
Dec 12 09:06:51 …server… newrelic-infra: time=“2016-12-12T09:06:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/1085: no such file or directory”
Dec 12 09:07:21 …server… newrelic-infra: time=“2016-12-12T09:07:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/1711: no such file or directory”
Dec 12 09:08:51 …server… newrelic-infra: time=“2016-12-12T09:08:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/3218: no such file or directory”
Dec 12 09:09:21 …server… newrelic-infra: time=“2016-12-12T09:09:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/4174: no such file or directory”
Dec 12 09:10:21 …server… newrelic-infra: time=“2016-12-12T09:10:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/5920: no such file or directory”
Dec 12 09:11:31 …server… newrelic-infra: time=“2016-12-12T09:11:31+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/7320: no such file or directory”
Dec 12 09:12:21 …server… newrelic-infra: time=“2016-12-12T09:12:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/8244: no such file or directory”
Dec 12 09:14:21 …server… newrelic-infra: time=“2016-12-12T09:14:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/10668: no such file or directory”
Dec 12 09:15:51 …server… newrelic-infra: time=“2016-12-12T09:15:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/12175: no such file or directory”
Dec 12 09:16:21 …server… newrelic-infra: time=“2016-12-12T09:16:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/12745: no such file or directory”
Dec 12 09:16:42 …server… newrelic-infra: time=“2016-12-12T09:16:42+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/13098: no such file or directory”
Dec 12 09:16:51 …server… newrelic-infra: time=“2016-12-12T09:16:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/13313: no such file or directory”
Dec 12 09:17:21 …server… newrelic-infra: time=“2016-12-12T09:17:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/13944: no such file or directory”
Dec 12 09:17:51 …server… newrelic-infra: time=“2016-12-12T09:17:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/14459: no such file or directory”
Dec 12 09:28:32 …server… newrelic-infra: time=“2016-12-12T09:28:32+02:00” level=error msg=“metric sender can’t process 0 times” error="InventoryIngest: events were not accepted: 500 500 Internal Server Error "
Dec 12 10:22:21 …server… newrelic-infra: time=“2016-12-12T10:22:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/23177: no such file or directory”
Dec 12 10:25:41 …server… newrelic-infra: time=“2016-12-12T10:25:41+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/27872: no such file or directory”
Dec 12 10:26:51 …server… newrelic-infra: time=“2016-12-12T10:26:51+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/29715: no such file or directory”
Dec 12 10:27:31 …server… newrelic-infra: time=“2016-12-12T10:27:31+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/31301: no such file or directory”
Dec 12 10:29:21 …server… newrelic-infra: time=“2016-12-12T10:29:21+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/3760: no such file or directory”
Dec 12 10:45:11 …server… newrelic-infra: time=“2016-12-12T10:45:11+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/19562: no such file or directory”
Dec 12 10:55:31 …server… newrelic-infra: time=“2016-12-12T10:55:31+02:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/30813: no such file or directory”

@markus.pelkonen

Thanks for attaching the log. Since these are different processes that are timing out, it appears that the Infrastructure agent is reporting as normal. If this was happening consistently for a single process, this would be indicative of a possible issue with metrics not reporting correctly.

As @Sarsaparilla mentioned, this usually happens when the agent collects the process ID and before it gathers stats the process finished, stopped or failed, resulting in no stats to collect.

Is there any way to to alter logging level for this particular scenario? Typically /var/logs/messages is quite silent and when there is something, our operators react to them. Or should we ask them to improve their processes, e.g. to filter out all newrelic-infra errors with this message?

Thanks for the feedback, we just asked our operator to ignored these entries.

1 Like

It says it had an error but because it isn’t an action item and is unavoidable then it really seems to be a design issue. We aggregate thousands of these from our servers every day. It would be most appreciated if a configuration option was added to silence these messages. Ideally logs don’t have errors that don’t need to be investigated.

time=“2017-08-01T09:53:24-04:00” level=error msg=“can’t get sample from ProcessSampler” error=“open /proc/44959: no such file or directory”