A critical function of Application Performance Monitoring tools is to alert on performance degradation based on the end-user experience and application components data. The challenge with alerting is that it needs to be accurate, meaning alert on true performance degradation from an end-user perspective with tangible user impact (number of users) and business impact (cost of not being able to execute a transaction in a performant way).
If the alert threshold is too low, it results in too many false positives; this causes alert fatigue and results in operational folks losing confidence in the alerting system. If the alert threshold is too high, operations teams could miss relevant performance degradation resulting in poor user experiences and ultimately business impact.
New Relic has adopted Apdex, an industry standard for measuring and reporting application performance in the context of actual end-user experience. This is a foundational concept to solving the complex alerting problem and has been refined over the last several years by servicing millions of applications.
Before we do a deeper dive on New Relic’s specific methodology and rationale on why Apdex was chosen, let us explore representative web performance data and show why other approaches are suboptimal.
A deeper understanding of web performance
Based on the application performance data collected by New Relic, we see a consistent performance profile of web traffic response times as shown in the figure below:
At New Relic, we call this an Erlang distribution that can be approximated by a log-normal distribution. This pattern is pretty consistent across varying intervals of time - 2 minutes, to hourly, to daily, to weekly, to annual time periods. This profile is also consistent for specific user transactions such as ‘Search for a product’ or ‘Add to Shopping Cart’ within an application. The reason for this distribution is that these systems nearly always fit the model of a queueing system, similar to picking a checkout line that would have the shortest wait time.
Below is the same histogram but with markers to indicate the mean response time (green line), 95th percentile response time (dashed line) and the median or 50th percentile (red line). Also indicated are the middle two quartiles, the red region, representing the response times that fall between the 25th and 75th percentiles.
Given this understanding let us consider the various approaches that APM vendors have taken in the past.
Static alert threshold based on mean
The most simplistic approach is to set the threshold to a static value and alert on anything that breaches the threshold or a specific number of breaches in a given interval of time. Setting the threshold based on mean would definitely cause an increase in false alerts. Hence most users of traditional APM solutions do not use alerts actively to perform their daily jobs or they set the thresholds way too high to prevent false alerts. As you can see from the graphic above, the mean is close to the 75 percentile, implying that a large percentage of transactions will breach the threshold based on the mean.
Simple rolling averages over a period of time
This is an improvement on the previous approach as it increases the weight of most recent performance in computing the average response time. The fundamental nature of web traffic does not vary between these intervals and the profile of performance within each time-frame remains consistent and follows the same pattern – the Erlang distribution. Therefore, almost 25% of the response times will be more than mean for any time-period within which the mean is calculated. Alert noise and eventual fatigue will result if the mean is used as a threshold and as a result most customers set the thresholds way too high, or they ignore the alerts. The other downside of this approach is that the system fails to alert for transactions that gradually get slower and slower, because the thresholds progressively get higher and do not reflect the user expectations of performance.
Dynamic baselines of response time based on averages
Some APM vendors use the approach of computing a dynamic threshold based on hour of day and/or day of week or month. In this approach, the APM tool computes a different threshold for the average response times for that specific hour and the average is set as a threshold for that representative period. As this approach also uses an average to compute the threshold, it ends up giving more weight to the outliers and setting the threshold higher than what a typical user would expect performance to be at. Again, this goes back to the characteristic of the performance profile being an Erlang distribution.
While this is an improvement in reducing false alerts, it misses the fundamental point of APM. These dynamic baselines mask predictable application performance inconsistencies as normal (ie: maintenance windows). Is a consistent 2 second response time during a Monday morning equivalent to a 10 second response time on a Friday evening? If this trend is consistent across multiple weeks for a specific application, the dynamic baseline marks this behavior as “Normal” and will not alert you.
Dynamic baselines can be helpful in automatically identifying deviations, but alerting on them does take time and effort to tune properly.
New Relic’s Sophisticated Alerting driven by Apdex
New Relic adopted Apdex since its early days to accurately alert on poor performance without missing degradations and not causing false alerts. We have refined this approach over the last several years by working closely with SMB customers who by definition have fewer resources and cannot waste time and critical resources on false alerts. By looking at big data streams across our APM product portfolio, we have developed unique highly differentiated best practices on setting the right values for Apdex configuration. Before we dive into the specific of our approach, let us explain the concept of Apdex.
What is Apdex?
The Apdex method converts many measurements into one number on a uniform scale of 0-to-1 (0 = no users satisfied, 1 = all users satisfied). The resulting Apdex score is a numerical measure of user satisfaction with the performance of enterprise applications. This metric can be used to report on any source of end-user performance measurements for which a performance objective has been defined.
The Apdex formula is the number of satisfied samples plus half of the tolerating samples plus none of the frustrated samples, divided by all the samples:
Apdex t = (Satisfied Count + Tolerating Count / 2) / Total Samples
where the subscript t is the target time, and the tolerable time is assumed to be 4 times the target time. So it is easy to see how this ratio is always directly related to users’ perceptions of satisfactory application responsiveness.
Apdex is like a histogram with just three buckets representing user experience: Satisfied, Tolerating and Frustrated. The buckets represent requests that are satisfying the user, requests that users just tolerate, and requests that fail to meet user expectations completely. The bucket intervals are 0, T, and 4T, where T is a parameter you choose in advance. Here’s what that looks like if you shade the Apdex buckets in our histogram:
Key Benefits of Apdex
Here are some of the key benefits of Apdex –
Apdex provides a way to compare response time patterns across time and deviation between these patterns. This helps overcome the fundamental flaw of using an average to set thresholds for performance profile that follow an Erlang distribution.
Provides one uniform measure across any number of transactions or applications. Software teams and business stakeholders do not have to worry about determining if 500ms a good response time for a checkout page? Or is 3 second a good response time for a complex search query? Apdex simply represents the user experience.
Enterprises can adopt this method to measure across multiple applications in a portfolio. This allows organizations with a number of applications to normalize the way they measure performance across those applications.
Simplicity of the measure leads to institutionalizing the idea of performance measurement. This in turn leads to monitoring and proactive action and work streams to improve performance of specific transactions and applications.
As Apdex compares shifts in performance patterns every minute, it eliminates the noise in alerting and saves IT operations teams considerable time wild chases of random outliers and false positives.
The ability to set different Apdex t values for different key transactions provides flexibility in measuring and quantifying different applications and transactions if needed.
New Relic methodology to implement sophisticated alerting
New Relic has developed best practices by monitoring thousands of applications; here are the best practices for setting and using Apdex codified into simple actionable steps:
Install New Relic agents and begin collecting data for your application
Identify your key user/business transactions that are meaningful decomposition of critical use cases that you application supports.
Identify the Key Transactions for each one of the critical use cases and set the threshold T to quantify and measure a good user experience
After some time of reviewing trended performance data, look at the New Relic histogram view for each of the Key Transactions and set the value of T to the 85th percentile.
The Apdex measure is calculated for each of the Key Transactions every 2 minutes for the previous 5-minute interval.
After 15-20 minutes of configuring T for your application, you are all set to benefit from the real-time scoring of every user transaction against the true expectations of your users and not some vendor’s algorithm that computes the dynamic expectation of your user community.
Debunking myths about Apdex
Apdex is very complex and time-consuming to setup
As we described previously the alerting methodology takes less than 15-20 minutes setup and is based on your view of the expectations of your users. Developers and infrastructure spend inordinate amounts of time to build capability that meets the user’s expectations of performance
Apdex can detect business impact only if more than 5% of transactions are slow
This is absolutely false and reflects the views of folks with limited understanding of New Relic’s implementation of Apdex. New Relic computes Apdex every 2 minutes irrespective of the number of number of users in the system. For every 2 minutes, New Relic compares the Apdex score and alerts based on the deviation in Apdex score. New Relic also offers the flexibility to set different T values for different transactions to configure the Apdex score specific to that transaction. For instance, the t value for a key transaction such as “browse a product category” or “search for a specific product” might be set to 1 second, but a report that pulls all users that bought a certain product today might be a less frequent data intensive transaction for which the appropriate t value might be 7 seconds.
Apdex based alerting causes false alarms due to static thresholds
User expectations for a specific kind of transaction are very consistent (static) irrespective of the time of day or overall load on the application (ie: spikes, holiday season). Now, given that the user expectation is consistently high and not dynamic based on extraneous factors, how do we prevent false alerts?
Random low static thresholds lead to a large number of alerts being generated. New Relic solves this by setting different thresholds for different transactions within the same application as described in the simple methodology in an earlier section.
To prevent sudden spikes in alerts during consistent performance degradation or during an outage, New Relic aggregates alerts into incidents and does not inundate the user’s inbox with more alerts on the same issue.