Hi @Trevor_Dearham - The server monitor (I’m using the generic, as this applies to Linux as well) collects metrics for every single process on the system, active or not (a running process that is idle is still recorded). This is one of the reasons it can run into an issue with zombie PIDs. While we see this almost exclusively with Windows, there has been the occasion or two where it has popped up in Linux. It is extremely rare, but not impossible. Regardless, every process, background or foreground, running on the system is queried, metrics are created, and the data reported to New Relic.
The Processes report, on the other hand, only displays the top 20. For a bit of an extreme example, it is possible (though incredibly unlikely) that of the top 20 CPU consumers, none were the top 20 memory consumers. In such a case, we would have to be sure metrics for at least 40 of the processes were collected. And that could change, literally, minute to minute, so which ones do we include or exclude? How do we know in that third minute it won’t change all over again? To ensure aggregation over any given time window, every process must be queried and accounted for.
Exposure to that data is a different story. You get to see the top 20 memory and/or CPU consumers in the processes list. That’s it. No more. There is not a configuration option to see more or a report that will contain more. That does not mean the metrics are not there or accessible. In fact, you can get at them one of two ways. The first is in the Data Explorer in Insights. There is a “Metrics” option, and from there select the entity type (Servers), the server name, and click on
ProcessSamples under the suggested searches. That may not expose every process running on the system, but I have yet to hit a maximum value for that list. My busiest test server has 48 running processes, and all of those show up. There may be a limit, so you might need to type in a process name for a more specific search or increase the time window to see previously running processes (there’s a drop-down in the upper right of the page, and the default is 30 minutes).
It is true that the server monitor does not collect data on the “state” of a service. Keep in mind, however, that the monitor reports on running processes. The lack of a process is something it’s just not designed to look for. Simply the idea of doing so opens a huge can of worms. For Windows, technically we could scan the registry for all installed services. But if we’re going to do that, wouldn’t we also need to include all installed applications? A good, and very important, example of why this would be necessary is the
w3wp.exe process. That is the process that serves up web pages, but it is not initially spawned as a service. In fact, it is a child process of
svchost.exe, which is started from the World Wide Web Publishing Service (the service name is W3SVC, but even this doesn’t match the process name). We would definitely want to track the usage for this process.
Services would, in the end, be the easy part. Obtaining the state of a particular service in Windows is actually pretty simple. Processes that do not run as a service is a different story. For those, it’s not so much determining the state (not running would be considered “stopped” and running would be considered “started”), but rather how doing the lookup itself would impact system resources. My simple test system has several thousand CLSIDs in the registry. The code couldn’t pick and choose. Any one of them could be important. Iterating through all of them during every single sample cycle (which occurs every 20 seconds) has the potential of significantly impacting server performance. Worse, much like the zombie PID issue, there would be more occurrences of the monitor not being able to finish the sample before the next cycle began, which would in turn cause the monitor to start consuming either CPU or memory at a grossly accelerated rate.
That doesn’t mean there aren’t possible solutions. The one that comes immediately to mind is providing configuration options that allow you to specify processes of particular concern, which in turn could “force” the monitor to report on their availability or lack thereof. This would include “state” for particular services as well as time spans for non-service processes that haven’t run during a specified period. This would, of course, require a feature request as appropriate code would need to be included to make this happen.
Doing it this way would also address the concerns with Linux. One of the beauties of Linux, and subsequently, potential issues, is the fact that there is no concept of anything like the Windows registry. Consequently, the only way you could determine what applications were installed would be if you were using a package manager, and had used it for every installed application. Even then, the number of package managers available in the wild would make coding the server monitor to determine all installed packages something of a nightmare. With the option to specify the process or daemon you want to check the state on, that concern would be alleviated. Still, I suspect there would be an upper limit to the number of processes, services, or daemons you could specify so as to ensure the monitor still doesn’t end up causing resource issues.
Right now, though, we can alert on metric data based on the metric name in New Alerts. As long as the monitor/agent (this applies to APM as well) is reporting, zero or
null is a valid value. Setting your reporting criteria to
< 1 (I would add a second condition, or possibly make it the only condition, equivalent to
=0 as a fractional value would fulfill the
< 1 criteria) means a result with zero will trigger an alert, as long as the metric exists and the monitor itself is still reporting. I suggest using throughput as the measure because the minimum threshold is five minutes, which means that the process would have to have no activity at all (assuming
throughput=0) in that time period before it triggered an alert. (Keep in mind there are multiple “alert trigger” time spans; 10m, 15m, 30m, 1h, 2h.)
With respects to looking at specific Java processes, I can offer a workaround for this as well, but it does take some effort. Both Windows and Linux report on individual processes based on their owner (in APM, the owner is in parentheses to the right of the process name). The metric name includes the process owner as well. If you want Java processes to report as separate processes, you can change the user account that runs each individual process. Create owner names that work best for organizational purposes and you can not only get CPU/memory usage based on the individual processes, but alerting as well.
Finally, alas, we come to one of the cruxes of the issue. To qualify for New Alerts, you must have a paid account. New Alerts is not available for free accounts. That is something I’m afraid there is no workaround for.
Hopefully this will help in you in understanding the possible options.