Client using large amount of memory

I have a customer app that is using New Relic. When I remove the NR gem, memory use decreases significantly.

This is the memory diff

2.0.0 increases by about 5mb, not bad
2.1.3 increases by about 50mb 0_o
2.2.0 increases but not as much

Here’s the script I used to benchmark: https://gist.github.com/schneems/78054da3d48db3e43c07

$ rake -f perf.rake ram_over_time RAILS_ENV=production USE_SERVER=1 TEST_CNT=100_000

Ran with and without the gem in the Gemfile.

Any ideas on how to drill down and figure out why there’s such a massive difference? Can you bench some apps and share results?

These different versions of Ruby have very different GC models. The move to generational GC doesn’t play quite as well with the Ruby agent. However, this extra memory does eventually get freed, just maybe not as quickly as with earlier Rubies. (However, the VM is doing a lot less work to free that memory. So it’s that classic trade-off between time and space optimizations.)

The axis on your graph just says “Time”. How much time does that represent, out of curiosity?

Chris Pine
Senior Software Engineer, Ruby Agent
New Relic

I put a more useable form of the script here https://github.com/schneems/derailed_benchmarks

The “time” represents the time to complete 100,000 requests which varies per run. It took well over 30 minutes though i’m not sure on the exact time, likely around 50 minutes.

I’m familiar with the changes to GC. Koichi Sasada the author of the new GC. He is one of my co-workers at Heroku and we’ve been digging into strange memory usage issues recently. The memory never gets freed as you suggest, however it does level out over time. While RGenGC can account for some increased RAM in most programs, i’ve seen this to be low, around ~5% increase, not 500% increase as seen in this case.

On another data point, here is the RAM that the client takes right away at require time rake -f perf.rake perf:require_bench:

  newrelic_rpm: 6.6211 mb
    new_relic/control: 6.6055 mb
      new_relic/agent: 4.2188 mb
        new_relic/agent/agent: 1.4922 mb
          new_relic/agent/configuration/manager: 0.4688 mb
        new_relic/agent/transaction_sampler: 0.4805 mb
      conditional_vendored_metric_parser: 1.5586 mb
        metric_parser: 1.5469 mb
          new_relic/metric_parser: 1.5234 mb

Regardless of the reason, we need to figure out why the client is taking up more RAM with newer ruby versions. If there’s a bug in the interpreter, then you can kick it back to me and I can work with Koichi to get it fixed. If there’s something we can do in the client to avoid this growth, then your customers will be happy to have that 50mb of RAM back.

I’ve put just about everything I know about profiling memory of running applications into those rake tasks, and it should be easy-ish to apply them to most any Rails 4 app. I still can’t figure out where the usage is coming from in the new relic client and I need help. One potential source of information i’ve not applied yet is https://github.com/ko1/allocation_tracer. I’m happy to work with you/NR on this, but I need some help. We can take this conversation offline if you prefer: richard@heroku.com.

We’ve created a ticket to better work to resolve this issue. If necessary, we’ll update this forum post with our findings. Thanks!

@richard_schneeman / @hakimu Any updates on this issue? I’ve experienced similar memory issues running NewRelic on Heroku.

Happy to provide additional information if helpful.

Hi @stevenr!

We’ve been spending a lot of time on this recently, and it looks like there are actually two separate issues here:

  1. In some applications with Ruby 2.1, the Ruby agent triggers higher, but stable memory usage while the application is under load, as @richard_schneeman found.
  2. Applications that are idle, and using Ruby agent versions 3.9.4 - 3.9.6 show a steady growth in memory usage over time.

We believe we’ve isolated the cause for the second issue, but we’re still working on fully understanding the first.

The second issue is related to how the Ruby agent communicates with our servers. By default, we’ll initiate a new SSL connection to our servers every minute in order to submit data. In the process of establishing a new SSL connection, the openssl library makes a number of native allocations that do not go through Ruby’s malloc wrapper, and thus don’t contribute to triggering a garbage collection run. The bulk of these allocations are associated with parsing the .pem file that we ship with the agent that contains the set of root CA certificates that we use to validate the certificate presented by our servers when establishing an SSL connection. Changes we made to this .pem file in version 3.9.4 of the agent exacerbated this problem.

These allocations don’t seem to be a true leak, in the sense that they are released when the associated Ruby objects that reference them are garbage collected, but the problem is that in an idle application that’s not servicing any requests, Ruby doesn’t see the allocations for these objects, and thus doesn’t realize that it should trigger GC.

Starting in version 3.9.7 of the Ruby agent (due to ship soon), we will be instead maintaining a persistent SSL connection to our servers for as long as possible in order to work around this issue (we’re also planning to cache the results of parsing the .pem file in case the connection does get severed, but that likely won’t make it in to 3.9.7).

If you want to try this behavior today, you can easily do so by adding aggressive_keepalive: true to your newrelic.yml file, or running:

heroku config:set NEW_RELIC_AGGRESSIVE_KEEPALIVE=1

As to the first issue of the stable-but-high memory usage for some apps using the Ruby agent on Ruby 2.1, we’re still trying to figure out how to best address this. You may have some success in forcing major GC runs to be triggered more frequently by lowering the value of the RUBY_GC_OLDMALLOC_LIMIT_MAX, at the expense of the increased overhead of doing those additional GC runs.

Part of the difficulty in understanding these issues is that every application triggers various code paths in the agent with differing frequencies. In our testing, we’ve not yet been able to reproduce deltas as extreme as @richard_schneeman’s application has, but I suspect this is just a matter of us not triggering the right code paths in the agent.

That said, the investigation that we’ve done so far suggests that the most expensive feature of the agent from a memory perspective is our transaction trace feature. If you want to see whether that’s the case for your application, you can try setting the NEW_RELIC_TRANSACTION_TRACER_ENABLED environment variable to false. If you try this, we’d be very interested to know the results for your application.

Quick update @benweint

After applying the NEW_RELIC_AGGRESSIVE_KEEPALIVE env variable, I’m seeing far more stable memory usage after ~48 hours of testing.

Look forward to getting the new release. Thanks!

Hey @stevenr - thanks for reporting back!

We unfortunately actually had to pull this change out of the 3.9.7 release at the last minute due to an issue we discovered with how this flag deals with the server closing the connection between harvest cycles.

We’re hoping to turn around another release soon to re-enable aggressive_keepalive, though, so I’ll update this thread when that’s available.

Ben

FYI, we’ve just released version 3.9.8 of the newrelic_rpm gem, which enables the aggressive_keepalive setting by default, and also caches the SSL CA certificate bundle that we use to validate certs presented by our servers.

Hey @benweint. Can you confirm that newrelic_rpm < 3.9.8 had memory leaks? If not I’ll have to make changes after https://github.com/ASoftCo/leaky-gems/pull/5/files.

Hi @sergey-alekseev!

It’s true that versions of newrelic_rpm between 3.9.4 and 3.9.7 (inclusive) will exhibit high memory usage for low-throughput applications under Ruby 2.1+ that aren’t having major GC runs triggered by the Ruby VM on a regular basis.

Whether this constitutes a ‘leak’ is a question of semantics, but I’d argue that it’s a misapplicaiton of the term - the memory is not ‘leaked’ in the sense that it will be recovered on a major GC run, the issue is just that the native allocations made by Ruby’s native OpenSSL extension don’t contribute towards triggering major GC runs.

I am seeing large amount of memory from New Relic’s client as well. It’s on a Rails 5.0 app with NR 3.16 hosted on Heroku.

Hey @harrychen, thanks for posting your findings to the forum. I’d like to open a Support ticket so we can gather some additional information on the memory usage you saw. Since you may not feel comfortable posting some of these details in a public forum, I’d like to open the ticket.

You should get an email from us soon with a link to the ticket. If you do not receive it, please check your spam folder first and then contact us if necessary.