Your data. Anywhere you go.

New Relic for iOS or Android


Download on the App Store    Android App on Google play


New Relic Insights App for iOS


Download on the App Store


Learn more

Close icon

NewRelic.ServerMonitor.exe CPU usage

windowsserver

#1

One of our servers shows significant CPU usage of the NewRelic.Servermonitor.exe process:

NewRelic.ServerMonitor (SYSTEM) #
11.7%
CPU USAGE
68.1 MB
MEMORY USAGE

Other servers show CPU usage of <2%

Is this normal behaviour?

Thanks!


Process sampling is taking over 15 seconds
#2

Hi @arnold,

Are you still experiencing high cpu usage?
It could be a “pid leak” causing high pid values, which results in more time spend reading process information and high cpu usage as a results.

Let’s investigate this:

  1. Launch the Windows Event Viewer. Under Custom Views, right click “New Relic Windows Server Monitor Events View” and look for Process sampling is taking over 15 seconds

Let us know your results.


#3

Thanks for your feedback.

I see many of those warnings in the event viewer. I restarted the monitoring service, the warning did not occur since, however. I do see another warning and error now:


#4

Hi @arnold,

Thank you for verifying that you see those entries in your Event Viewer. Specifically, the Process sampling is taking over 15 seconds... warnings are indicative of a larger issue somewhere present on your server/instance/machine.

It is likely there is a problem application, or applications, that are not releasing Process IDs once the process is shutdown. This causes the number of “in use” PIDs to appear much larger than normal and to increase over time. While not always true, we have seen that a broken application is the cause for the creation of multiple PIDs in its normal operation. This is what results in rapidly increasing PID counts.

Normally, a process will properly release the PID it was running under allowing Windows to reuse that PID for another process after a short time period. This results in PIDs remaining in the sub 10000 range even on heavily utilized servers.

When experiencing this issue, many PIDs cannot be reused so the count continues to rise. This can seen when checking the PID field in Task Manager or under Software Environment > Running Tasks in an MSINFO32. The PID value of many or most of the processes will be extremely high. We have seen PIDs in the 200K to 450K range in past experiences from customers.

Its important to note that the number of running processes may not be very high. The unreleased PIDs are not easily discoverable in the Windows UI, if at all.

When WSM is gathering data it checks each process individually for PID, owner, memory, and CPU usage. When the system is operating normally the request for each process from the array provided by System.Diagnostics.Process.GetProcesses() takes milliseconds. When the PID count is enormous that time per process can becomes a dozen seconds or more. The end result is that what should take a few seconds is taking dozens. Based on testing though unconfirmed, the polling cycles will then start to overlap resulting in ever increasing resources being directed toward WSM. This same problem exists on per process queries via WMI because the problem is an underlying issue in the Windows infrastructure where these zombie PIDs are slowing down the act of querying for process stats.

Version 3.3.2.0 and later of WSM will create a Warning in the Application Event Log stating Process sampling is taking over 15 seconds when the issue is occurring, which is what you’ve verified above.

To reiterate, this suggests that there is a serious issue on your server. If need be, we can open a support ticket for you if you need us to review your MSINFO32, or to review the system by hand.

Ultimately the solution is to fix this leaking PID issue. If this is not easily identifiable, we do have a workaround, but this does not resolve the issue, but changes the behavior of the agent to not generate those warning messages by polling the PID list at a slower interval, which also changes the data being sent to your New Relic UI.

If we need to provide those last resort workarounds, we can. We would prefer to help guide you to resolving the issue on the system, before suggesting our known workarounds.


Regarding your latest warning message, those should not be any cause for alarm. The documentation page provided in the message is entirely accurate and you’re welcome to set the value to False with the options outlined (manually or via Powershell).


#5

All,

I just started using New Relic Servers. I am also witnessing high CPU usage on some of my servers. I have 2 servers hosting remote desktops (~75 sessions total). I am seeing “Process sampling is taking over 15 seconds”. I think that there are too much active processes on this server.

Is there some way I can limit process sampling? Now I am not able to use it, NewRelic.ServerMonitor.exe is eating to much.


#6

Hi @gerard,

This is a symptom of an issue with your system and it should be addressed rather than ignored.

Check your processes list are the pid values very high? If you see the 15 second warning, it should mean that something is preventing pid values from being recycled. I would consult with your hosting provider to find and address the source of the issue.


#7

I myself have the same issue. I’m using the 3.3.3.0 and still see this issue. Rebooting the server fixes the problem temporarily until the server has been in active use (I have 25 remoteapp users on average). Once I see a high spike in CPU, the server never corrects itself and see the CPU/memory spike for the New Relic process. Under normal conditions, I don’t see spikes more than 5-6% CPU or less.
I’m using Windows 2012 R2 with all the latest updates.
It’s hard to imagine those of us reporting this issue are all having a problem and all our boxes are not configured properly? I have New Relic server monitoring running on all my servers, and this is the only server I get this sampling error. 25 remote users is not considered high by any means and do not report problems on any other box other than this one.


#8

To specify, high CPU usage with the WSM agent and the long polling times are potentially related, but not correlated with a 1:1 ratio 100% of the time.

If you specifically see log messages that read: Process sampling is taking over 15 seconds this is indicative of a PID leak and speaks to an issue with an app on the the system or the OS itself. The active users should not be related to these log entries and hopefully, not related to any performance impact our WSM agent may be having on your system.

If those log entries are not present and you’re experiencing issues with high CPU usage that can be attributed to our server agent, this is something worth investigating and not necessarily related to an issue of PID leaks or necessarily poorly configured servers.

If you do see these log entries, my message previously posted should guide you to a resolution. If this doesn’t seem to be the case, please let us know and we can either work from here, or we can get a support ticket going for deeper investigation.


#9

Hi @netheroth / New Relic Support

I’m currently running into this issue at the moment.

I get the Process sampling is taking over 15 seconds warning in event viewer.
The largest process Id on the server is 465348

So far I’ve been unable to figure out the root cause for the max PID to be that high. I don’t see many new processes being created at the moment. However the server has been on for a while time (looks like about 9 days).

However I’m still surprised that sampling takes over 15 seconds. As a quick test I’ve run System.Diagnostics.Process.GetProcesses() manually a bunch of times and 71ms is the longest it’s taken.

PS H:\> Measure-Command { [System.Diagnostics.Process]::GetProcesses() }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 71
Ticks             : 713623
TotalDays         : 8.25952546296296E-07
TotalHours        : 1.98228611111111E-05
TotalMinutes      : 0.00118937166666667
TotalSeconds      : 0.0713623
TotalMilliseconds : 71.3623

I’m not sure if you want to handle this in a new issue but I could use some assistance.


Relic Solution: Windows Server Monitor Logs
#10

Hello @daniel.little. I’m going to start by encouraging you to reread Neth Roth’s post again. It is fairly obvious you are suffering from a PID leak (what we lovingly refer to as “Zombie PIDs”). With 465,348 processes you have 455,348 more PIDs than you should have. With that many PIDs we would expect the CPU usage to peg. I suspect your system has a large number of cores, which is the only explanation I can think of that would be preventing the CPU usage from jumping to 100%.

Let me address your manual test first. Yes, we use System.Diagnostics.Process.GetProcesses() to obtain the list of PIDs in the first place. However, the Windows Server Monitor (WSM) then checks for the owner, memory, and CPU usage on every single one of those PIDs. It’s not just pulling the data, it’s parsing the results. That is what is taking the time your manual process is not. Consequently, that is why polling cycle A cannot finish before polling cycle B is scheduled to start.

We know what the overarching issue is. The details are not so easy. Here is an extremely key sentence in the explanation that Neth provided:

The unreleased PIDs are not easily discoverable in the Windows UI, if at all.

I work very hard at not being deprecatory about Windows. For all its faults, there are a lot of positive aspects of the operating system. Unfortunately, this is an area that represents some significant challenges in determining what is going on under the hood. It’s not a lack of willingness to help you; we really do want to do our utmost to help in understanding the underlying issue. The problem is there is not, as far as we have discovered, a simple way to figure out which application(s) is generating these zombie PIDs. One of our senior developers, one of the most wicked smart .NET developers I’ve ever met, has addressed this issue, and even he has not been able to come up with an easy solution for tracking down the offending application.

All that said, I’m going to provide a [possible] solution moving forward to (hopefully) discover the root cause of the zombie PID issue. I don’t think you’re going to like it (I sure don’t) as it is extremely cumbersome. Still, it is about the only way to even start in attempting to discovering the offending application(s) and how to correct the issue with them.

Let me start with a few basic premise’s:

  1. It is highly unlikely the offending process is a native Windows process or service. While (as previously stated) the Windows operating system does have its faults, we have yet to discover a native process that will create zombie PIDs. That is not to say it couldn’t happen, but rather, native Windows processes would be the least likely suspect.

  2. Commercial applications are second on the list of least likely offenders. That most definitely does not rule them out, but despite the many bugs we often see in commercial applications, most software vendors are extremely fast in addressing issues that could affect CPU or memory on a system (New Relic certainly addresses them as a top priority).

  3. That leaves in-house applications or services as the primary suspects. When investigating an issue like this we would want to start with these type of applications at the top of the list. In order of priority, the investigation would focus on 1) in-house custom applications, 2) third party custom applications, 3) commercially available applications, and 4) Windows native applications.

Keep in mind I’m using the term “application”, but this pertains to both standard applications as well as Windows services interchangeably. It might be more accurate to refer to them as processes, but a processes is still an application, so I use application as the primary term.

From there, one by one, we need to take a look at each process running on the system, based on the priority I’ve outlined. We want to start with full memory dumps. Details on how to do this can be found at the following link:

https://support.microsoft.com/en-us/kb/969028

This describes the process for Windows versions up to 2008, but I believe it is the same for 2012. It is also a fairly complicated and convoluted process to get the “full” dumps. But remember, I qualified at the start of this the entire procedure was cumbersome. I do wish, rather vehemently in fact, that there was an easier way.

Once we have the memory dump, the next step is to analyze it. There are two tools I will offer for this (there may be others). The first one is Debug Diag:

https://www.microsoft.com/en-us/download/details.aspx?id=49924

This is the more simplistic tool for analyzing a memory dump file. Usage of this tool is way out of the scope of this post, so I strongly suggest using Google (or whatever your favorite search engine is) to search for a tutorial on using this tool. That goes for the second tool I will recommend, WindDbg (Windbag):

http://www.windbg.org/

This one is more complex to use and set up, but it does provide a deeper analysis and can be an excellent tool when DebugDiag is not enough.

Going back to the “cumbersome” portion of this, a memory dump and diagnosis will be necessary for [literally] every process on the system. You will want to start with the more likely suspects first, going back to point 3 above. Should the first priority not produce any smoking guns, you’ll need to work back in order of priority from there for every process running on the system.

Yes, this is a significant amount of work that will take a lot of time. In the end, though, there are two very positive outcomes. One, you will have identified a rogue application on the system that is causing a root level issue. Two, the WSM will stop consuming so much CPU in the effort to provide diagnostics for system performance.


#11

Experiencing this issue also, again on a Remote Desktop Server w2k12. It’s obvious that there is a PID issue here, however I can’t see anyone specifically addressing the Remote Desktop Component. It would be very helpful if someone from New Relic addressed the question to validate that the agent is categorically only supposed to be taking <1% of CPU on Remote Desktop Servers, even when there are 100 users on the machine.

So far none of the answers address RDP Servers.


#12

@bojan

I don’t believe the RDP server part of the discussion is relevant. As far as the server monitor goes a server is a server. WSM is expected to use small amounts of CPU under normal circumstances even with hundreds of processes running on the system (even so I wouldn’t necessarily guarantee <1%). When the zombie PID issue arises then retrieving process info slows down causing increased and overlapping sampling periods which results in increased CPU utilization - this seems to be a defect with the underlying WMI queries for retrieving process info. In any event the underlying problem is not with the agent.

On a related note please see this post.