Relic Solution: Troubleshooting Java Agent Memory Leaks and OutOfMemoryError Issues with Eclipse Memory Analyzer Tool

Oftentimes the New Relic Support team receives requests from users reporting that the Java agent might be responsible for unacceptable memory usage. If you suspect that the Java Agent is causing your application to run out of memory or you’ve received an OutOfMemoryError, reviewing a heap dump from the JVM with a Memory Analysis tool can be helpful in determining a path forward. Given the size of some heapdumps, as well as security requirements in your organization, providing a heapdump to our team may not be an option. In this document, we will discuss the use of the Eclipse MAT for the purpose of determining problem areas in memory management and leak detection as it relates to the Java agent.

Memory issue types

In general, we really only have 3 different “classes” of memory issues with the Java Agent.

1) Heap size too small

This is the simplest memory issue and it occurs when a user has set their -Xmx value low enough that adding the agent simply pushes them over that limit. For example, if a user runs with their -Xmx set to 256MB and their application has been tuned to sit right at 250MB of heap usage, then adding the agent will almost certainly push them over the edge and into OutOfMemory territory.

This occurs because the agent has a fixed size of memory it requires in order to function and we do not utilize off-heap memory to avoid this issue. An undersized heap is not a super common issue for our Support team, but still worth noting.

JVM options for allocating memory:

  • -Xms - initial java heap size
  • -Xmx - maximum java heap size

2) Real memory leak (caused by application code)

This issue is relatively self-explanatory and just means that an application has always had a memory leak and it happened to occur when the agent was attached. This issue is actually pretty rare from what we can tell - but can be confirmed/denied based on the heap dump contents (more on that below).

3) Real memory leak (in the presence of the New Relic Java Agent)

When we run into real memory leaks in the presence of the agent, the causes include (but are not limited to):

  • ClassLoader leaks (preventing app servers like Tomcat from freeing up memory on redeployment)
  • Instrumentation that stores strong references to objects that are never freed
  • Internal Java Agent services that store strong references to objects that are never freed
  • Instrumentation that prevents objects from being collected (rare)

It should be noted that when we encounter one of these memory leaks it is very common that a large number of users end up running into the same issue (and it is very likely on our Support team’s radar already).

Tools/Setup Required

  • For analyzing the heapdump, we will be using the Eclipse Memory Analyzer: Eclipse Memory Analyzer Open Source Project

  • You may need to bump up the heap available to MAT to allow large heap dumps to be loaded

    • Edit mat.app/Contents/Eclipse/MemoryAnalyzer.ini
    • Change “-Xmx1024m” to “-Xmx4096m” or similar (the higher the better, depending on your machine’s available resources)

Troubleshooting steps

  1. In order to troubleshoot any OutOfMemory issue, we must get a heap dump. There is no other simple way to verify the cause of a memory issue without a heap dump, which will contain a snapshot of everything on the heap at the time that the dump was triggered.

There are two ways to capture a heap dump:

  • By setting the following JVM property and allowing the JVM to capture it automatically when an OutOfMemory error is triggered: -XX:+HeapDumpOnOutOfMemoryError

  • By manually triggering a heap dump with the following command: jmap -dump:file=/tmp/app_heapdump.hprof <pid>

NOTE: If using step #2, it’s important that it is run only when the heap memory utilization is at or near maximum heap usage, otherwise the heap dump might not contain enough information to troubleshoot further.

  1. Now that we have a heap dump to work off of we need to download it locally and load it into the Eclipse Memory Analyzer. Heap dumps will generally end with .hprof or .bin and both can be loaded by MAT:

32%20AM

  1. MAT will take a while to parse the file (depending on how large it is). Once complete, you’ll be shown the “Getting Started Wizard”. Select Leak Suspects Report and click Finish . This step is not required but the Leak Suspects Report can help save you some time when trying to find the cause of a memory leak.

  1. As soon as the dump is loaded, you’ll be shown the Leak Suspects page. The important things to take note of are:
  • The total size of the heap
    If the size of the heap is way less than the -Xmx then it’s unlikely that this dump will be helpful. The total size should end up being very close to the maximum size of the heap if we have a memory leak and the dump was captured at the correct time.

  • The size of each Problem Suspect
    If we have a problem suspect that is taking up 50% or more of the heap it’s very likely that we’ve found the cause of our memory leak.

  • The location (package) of each Problem Suspect
    The package of a problem suspect can tell us quite a few different things. If the problem suspect is in “com.newrelic.*” then it’s very likely that we have a real memory leak related to transactions within the agent (most commonly). If the problem suspect is in a user or application’s package then more investigation will be required to figure out the cause of the leak. It is always possible that a memory leak is occurring due to an instrumentation bug that gets weaved into the application’s classes.

As an example:

  • The total size of the heap = 221.7MB. This is pretty small, so unless the -Xmx is set to 256MB this would not necessarily indicate a real memory leak.

  • The problem suspects are only about 25% of the entire heap combined, so it’s not likely that these are significant in this case.

  • The fact that one of the problem suspects is a ClassLoader is interesting and could indicate a problem with instrumentation or the agent since we do make use of ClassLoaders fairly extensively. But notice that both of these are outside of the com.newrelic.* prefix so it’s not a silver bullet at this point.

Example Troubleshooting Steps #1:

When reviewing a heapdump initially, you’ll want to look at Leak Suspects from the Overview page:

This shows which items are holding the most memory. In this example, you’ll see there is really only one item, the New Relic Java Agent’s Transaction Service (com.newrelic.agent.TransactionService):

If you click on the Details button at the bottom you’ll see a page which shows the breakdown of where the object holding all the memory is coming from. In this example we’re seeing many ConcurrentHashMap nodes which are held by the updateQueue:

To drill down into the updateQueue, click on the link to the TransactionService object to the right of updateQueue and select List objects and with outgoing references:

You can expand the tree down to the updateQueue, and then even further to the transaction objects stored in it. You’re looking for the names of those transactions by finding and selecting the priorityTransactionName - the fields for it will show up on the left panel. We’re looking specifically for the partialName of the transaction:

We can now see that in this example /SpringController/coupon/discountList/{placeId} (GET) transactions that aren’t finishing quickly (or at all) seem to be the source of the problem.

Now that we know which transaction(s) are being held in memory by the TransactionService, we can begin to investigate.

  • What do we know about that endpoint?
  • Is there any custom instrumentation in use?
  • How long should the transaction take to complete (long-running transaction)?

Example Troubleshooting Steps #2:

As before, you’ll want to look at Leak Suspects from the Overview page:

You’ll can see there is one item, the New Relic Java Agent’s Transaction Service (com.newrelic.agent.TransactionService). If you click on the Details button at the bottom you’ll see a page which shows the breakdown of where the object holding all the memory is coming from. In this example we’re again seeing many ConcurrentHashMap nodes which are held by the updateQueue:

To drill down into the updateQueue, click on the link to the TransactionService object to the right of updateQueue and select List objects and with outgoing references:

Expand the tree down to the updateQueue, and then to the transaction objects stored in it. You’re looking for the names of those transactions by finding and selecting the priorityTransactionName and then the partialName of the transaction:

We can now see that in this example /Custom/API Gateway Dispatch transaction seem to be the source of the problem, and we can begin to investigate.

Important note:
Seeing com.newrelic.agent.instrumentation.context.InstrumentationContextManager near the top of the dominator tree does not indicate a memory leak with our agent. It is not uncommon to see this class taking up 10-25MB of heap space, as we use it for caching important weave-related information.

6 Likes