Which version of Windows or which distribution of Linux are you using? Debian 8 (Jessie)
What language version of the APM agent are you using? PHP (We have applications on PHP7.0 and PHP5.6)
What is your Infrastructure Subscription level? Essentials or PRO? Essentials
Describe what you are seeing. How does this differ from what you expected to see? I am seeing newrelic-infra agent using most of the CPU on one of our hosts. This is reported in NewRelic but is also consistent with reporting tools on the OS (top). The agent remains at 20-25% overhead. Is this expected? Is it the result of the high number of applications on this server (60)? Is it expected? I don’t expect to pay 20% overhead to monitor my processes…
Hi @trey, this is not expected behaviour though I notice you are using
v1.0.898 of the Infrastructure agent which is quite old. So i’d recommend updating to the newest version first then seeing if there are any improvements to usage
Thank you for your response. At first glance updating appears to have solved the problem. Will keep an eye on it and check back in if it reappears.
@rdouglas Checking back in, the issue is popping up again. newrelic-infra is also generating several dozen GB of logs per day. Excerpt:
time="2019-04-16T15:49:20-04:00" level=error msg="could not queue event" error="Could not queue event: Queue is full."
That is the only message. Here is my agent config:
license_key: **************** log_file: /var/log/newrelic/newrelic-infra.log log_to_stdout: false event_queue_depth: 5000
Obviously something is wrong. Would love some more insight. Thank you!
Jumping in here for @rdouglas, who is out. We have a configuration option available for the
Could not queue event: Queue is full messages.
newrelic-infra.yml file, please add the following line:
You can increase by 1000 until the messages go away. This increases the size of the on-disk cache file.
Its also important to remember that the processes displayed on the processes tab is on a per core basis. So its a percentage of a single core, not all the cpus on the host.
Please let me know if you have any questions.
Hi Paul and thanks for your reply. Notice that I’m already at 5000 on queue depth. Is it a good idea to continue to increase?
Perhaps I could have been more clear, but when I say 20% I’m talking about total system load. It seems to live between 40-80% of single core consistently, which again seems like way too much overhead for monitoring.
I will continue to monitor and check back in.
There isn’t a max value for queue depth, so no problem on increasing it. I would increase it until the log messages stop.
We have another config option that may help also. You could try setting:
This will allow the agent to use more than 1 thread or distribute the load to more cores.
Hope this helps!
I’ve changed several values trying to get this working better today and sort of eyeball monitored the results. Here are some observations:
event_queue_depthdoesn’t help the full queue issue. The log accumulates at the same rate with a value of 1000 or 10000 or 50000. It does increase start-up time and memory usage of the agent.
max_procs: -1seems to results in less frequent, larger spikes in CPU usage. Average CPU usage by
newrelic-infraseems more or less the same
For now I have redirected the
newrelic-infra log to
/dev/null to avoid filling my disk (seriously, ~200 GB per day; this is not how to log). I would love to see a solution for this and I would definitely love to see lower CPU usage on the agent. As I’ve said many times already this seems like an awful lot of overhead to pay for monitoring, and I will be experimenting with other monitoring solutions.
Hey @trey I’m sorry to hear that you’ve had such poor performance relating to our Infrastructure agent. We certainly don’t expect that kind of overhead. It sounds like there may be some missing pieces to this puzzle and I’d love to open a support ticket for you on this topic so we can speak more explicitly about your environment.
If we find anything of interest to the community, we’ll post it back here.
To close the loop here, I wanted to followup publicly and let any followers know that this has resulted in a feature request to allow blocking of device mounts with regex instead of literal string values.
The issue here was that added resource cost of the agent was exacerbated by having many chrooted domains mounted to the same core mount location. Because the agent saw these individual mounts as unique devices, it resulted in the agent hitting the same base mount, over and over again, collecting the same data from the same source, quite heavily.
The solution is to use the:
…configuration option, to ignore the problematic device. However, that option only accepts a literal string path and results in a lack of visibility into the true mount device.
To that end, I filed a request with our dev team to modify the config option to accept regex values, which would fully solve the problem, and also allow visibility into any source mount location.
While this increased performance impact was specific to the user, systems with shared hosting platforms like Wordpress, may want to investigate this path, but is an atypical experience. Our moderators will come add a poll so that others can weight in on this.