Load testing lab performance drops over 50% when using New Relic .Net Agent

We have created a performance lab which uses JMeter to create load on an ASP.Net site on Windows Server 2012/IIS 8.0. In summary, we are seeing significant performance degradation when using the New Relic .Net Agent to monitor it.

The way we do load testing is to incrementally add virtual users until the average response time exceeds 30 seconds or the error rate is significant. In our lab, we can reach about 1332 virtual users (VUs) at which point the response time exceeds 30 seconds. But when we enable the New Relic .Net Agent and the New Relic Server monitor, a few things happen at 612 VUs:

  1. The response time climbs dramatically eventually exceeding 30s.
  2. The error rate jumps dramatically, reaching over 50%. Almost all of these errors are “System.Data.SqlClient.SqlConection.Open().System.InvalidOperation: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.”
  3. The CPU usage of the process increases dramatically, up to 100%

So our failure point is over 50% worse. As an experiment, we turned off the New Relic System Monitor since it isn’t as important to us and we saw similar behavior, but at 720 VUs. So it seems to be mostly related to the New Relic .Net Agent.

  1. How much overhead should the NR .Net Agent be putting on the web server? According to what I’ve read, this should be minimal.
  2. Is there any significant issue with the New Relic Agent when the CPU usage is high? We only saw these issues when the system was stuck at 100% usage.
  3. .Is it fairly common to use New Relic with load testing? I’ve seen a number of posts for things like Blazemeter integration (NR and BlazeMeter), people using JMeter with Ruby and Java (Page Loading Time with Apache JMeter). So it doesn’t like we’re using the tool in a way that is overly unique.
  4. Is there anything we can do to the agent to tune it? Maybe reduce what it records by changing something in newrelic.config?
  5. Why would New Relic be causing SqlConnection errors? I would not have expected New Relic to influence that. My only theory is that the system is so overloaded, it can’t process its own connections.

@rpuust -

  1. How quickly do you ramp up your VUs and how long do your tests last? The presence of our profiler causes the JITing of the .NET framework which adds significant response time and CPU overhead for 10s of seconds.

  2. Related to 1 - is iis set to recycle app pools on excessive CPU or other resource use? Maybe there is a bad feedback loop going on (profiler causing JITing causing excessive CPU causing app pool recycles causing profiler restart, etc.)

  3. Related to 1 and 2 - can you let your application warm up before conducting the tests, i.e. don’t start your test until a minute or so after the initial request.

  4. Do you have any .NET agent custom instrumentation in place? I doubt it but I thought I would ask.

  5. Can you provide a perma-link to your New Relic dashboard during a time of one of these tests? Only New Relic will be able to view the perma-links.

Hello Tjd,

Let me answer to your questions:

  1. The test starts with 360 VU and on every 5 minutes it adds another 36 VU. The total time of the test is 3 hours. At the end of the test the number of VU is 1620.
  2. Recycle options on IIS are disabled.
  3. There is a warm up script which does the same thing as the load script but just for 2-3 minutes and with only 20 VU. This script is executed twice because in the first execution there are some pages which take a lot of time to load and on the second warm up execution such spikes does not exists ( this is IIS behavior so nothing to worry about).
  4. No instrumentation
  5. https://rpm.newrelic.com/accounts/314814/applications/6148796/traced_errors?tw[end]=1422951191&tw[start]=1422346391

Just a note: What we have observed is that most of the issues are related to DB timeout expired. That`s why we will try to use NR on the same environment with increased number of sql connections. Could you give us some idea what is the number of SQL connections used by NR in order to monitor the database?

Hello Tjd,

The test have completed and I have more information to share:

  1. It seems that increasing the sql connection pool helped a lot to not see DB timeout exception.
  2. You can see the summarized results of both executions below:

Some information about the image:

  • 30th Jan - NR off / off - this is the same test with NR agent off and NR Server monitor off
  • 3rd Feb - NR on / on - this is the test with NR agent on and NR Server monitor on
  • PERF01 / PERF02 - are the VMs which runs the JMeter test
  • Average - is the average response time of PERF01 and PERF02
  • Diff - is the difference between Average columns of both tests. As you can see the average response time in the test with NR on is a lot higher than the test without NR.

The permalink of the execution can be found here - https://rpm.newrelic.com/accounts/314814/applications/6148796?tw[end]=1422965880&tw[start]=1422954371

If you have any questions related to the information above do not hesitate to ask.

Hello Tjd,

I have executed all benchmark scenarios with NR agent on, NR server monitor on and the same set of scenarios with NR off, NR server monitor off in order to see what is the impact of NR in terms of response time.

As you can see from the images above the degradation for all tests varies a lot. Some information about the scenarios:

  • P1 - simulates registered user who does checkout with single product. The degradation is ~12%
  • P2 - simulates registered user who does checkout with more products (5). The degradation is ~13.2%
  • P3 - simulates guest user who does checkout with single product. The degradation is ~20%
  • P4 - simulates guest user who does checkout with more products (5). The degradation is ~19.6%

During those tests none of the machines are highly utilized. Any idea whether such degradation is expected?

The permalink for the test with NR on can be found here - https://rpm.newrelic.com/accounts/314814/applications/6148796?tw[end]=1423042574&tw[start]=1423033238
From the results you can see 6 different periods where an execution occurs:

  • 1 - warm up script has been executed
  • 2 - P1 execution
  • 3 - P2 execution
  • 4 - P3 execution
  • 5 - P4 execution - here one of the transactions took a 300 seconds and that`s why the test have been rerun
    *6 - P4 execution

I would appreciate any feedback about what might cause such degradation in the results.

Thank you in advance!

@nikolay_stanoev - Thanks for all the great data. At this point I’m going to need some additional information, specifically your logs in debug mode and probably memory dumps so we can see what the agent may be doing. For your privacy I will be opening a ticket for you in our ticketing system so that you can upload these. You should be receiving an email message from us shortly, with a link to this ticket. If you do not receive it, please check your spam folder first and contact us if necessary.

@nikolay_stanoev,

First, I want to thank you for your patience and help as we worked through this issue. The data you gathered from the tests you performed helped us to better understand this particular performance issue and made us aware of future agent improvements we will want to investigate.

I believe the following summarizes the issue:

The agent introduces a small fixed amount of overhead to each traced method. This overhead is almost always insignificant because the methods we trace usually take an order of magnitude longer to execute than our instrumentation.

However, if we’re tracing many methods per transaction (such as a lot of database calls) and the methods are really fast, then the fixed overhead can result in a larger percentage overhead for the transaction as a whole. We take this overhead very seriously. We are always seeking to reduce the overall fixed cost and to make sure we’re careful about which methods we instrument.

Please feel free to add any additional information you think may be helpful.