Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

Newrelic-infra agent using 20% of CPU


#1

Permalink: https://infrastructure.newrelic.com/accounts/1446325/processes?timeStart=1554908106000&timeEnd=1554911706281

  • Which version of Windows or which distribution of Linux are you using? Debian 8 (Jessie)

  • What language version of the APM agent are you using? PHP (We have applications on PHP7.0 and PHP5.6)

  • What is your Infrastructure Subscription level? Essentials or PRO? Essentials

  • Describe what you are seeing. How does this differ from what you expected to see? I am seeing newrelic-infra agent using most of the CPU on one of our hosts. This is reported in NewRelic but is also consistent with reporting tools on the OS (top). The agent remains at 20-25% overhead. Is this expected? Is it the result of the high number of applications on this server (60)? Is it expected? I don’t expect to pay 20% overhead to monitor my processes…


#2

Hi @trey, this is not expected behaviour though I notice you are using v1.0.898 of the Infrastructure agent which is quite old. So i’d recommend updating to the newest version first then seeing if there are any improvements to usage :slight_smile:


#3

Thank you for your response. At first glance updating appears to have solved the problem. Will keep an eye on it and check back in if it reappears.


#4

Thanks @trey - let us know how it goes :smiley:


#5

@rdouglas Checking back in, the issue is popping up again. newrelic-infra is also generating several dozen GB of logs per day. Excerpt:

time="2019-04-16T15:49:20-04:00" level=error msg="could not queue event" error="Could not queue event: Queue is full."

That is the only message. Here is my agent config:

license_key: ****************
log_file: /var/log/newrelic/newrelic-infra.log
log_to_stdout: false
event_queue_depth: 5000

Obviously something is wrong. Would love some more insight. Thank you!


#6

Hello Trey,

Jumping in here for @rdouglas, who is out. We have a configuration option available for the Could not queue event: Queue is full messages.

In the newrelic-infra.yml file, please add the following line:

event_queue_depth: 2000

You can increase by 1000 until the messages go away. This increases the size of the on-disk cache file.

Its also important to remember that the processes displayed on the processes tab is on a per core basis. So its a percentage of a single core, not all the cpus on the host.

Please let me know if you have any questions.

Regards,

Paul


#7

Hi Paul and thanks for your reply. Notice that I’m already at 5000 on queue depth. Is it a good idea to continue to increase?

Perhaps I could have been more clear, but when I say 20% I’m talking about total system load. It seems to live between 40-80% of single core consistently, which again seems like way too much overhead for monitoring.

I will continue to monitor and check back in.


#8

Hi Trey,

There isn’t a max value for queue depth, so no problem on increasing it. I would increase it until the log messages stop.

We have another config option that may help also. You could try setting:

max_procs: -1

This will allow the agent to use more than 1 thread or distribute the load to more cores.

Hope this helps!

Paul


#9

I’ve changed several values trying to get this working better today and sort of eyeball monitored the results. Here are some observations:

  • Changing event_queue_depth doesn’t help the full queue issue. The log accumulates at the same rate with a value of 1000 or 10000 or 50000. It does increase start-up time and memory usage of the agent.
  • Setting max_procs: -1 seems to results in less frequent, larger spikes in CPU usage. Average CPU usage by newrelic-infra seems more or less the same

For now I have redirected the newrelic-infra log to /dev/null to avoid filling my disk (seriously, ~200 GB per day; this is not how to log). I would love to see a solution for this and I would definitely love to see lower CPU usage on the agent. As I’ve said many times already this seems like an awful lot of overhead to pay for monitoring, and I will be experimenting with other monitoring solutions.

Thanks again.


#10

Hey @trey I’m sorry to hear that you’ve had such poor performance relating to our Infrastructure agent. We certainly don’t expect that kind of overhead. It sounds like there may be some missing pieces to this puzzle and I’d love to open a support ticket for you on this topic so we can speak more explicitly about your environment.

If we find anything of interest to the community, we’ll post it back here.