If you use averages, you ARE missing the most critical events

I recently had an eye opening experience with regards to measuring latency
Now that I have a better understanding of what averages demonstrate, I see some many NRQL examples that use “Average()” and I’m just trying to pay this forward.

The mans name is Gil Tene and I promise you this is worth reading if you create any dashboards. He also has several youtube videos called how NOT to measure latency.

Here’s a visual example of what the video and blog taught me:

This is real data from a site that said everything looked good, but we were getting a lot of complaints.

These charts were NOT tampered with
Average and Median Duration Response Time

After Adding 95th and 99th Percentile Duration Response Time
At worst, with this chart we are ignoring 1% of our real results

Let’s add the 99.9 Duration Response Time

99.9th is VERY close to the max right? How bad could it be
Max Duration Response Time

OUCH These are not outliers, they are consistent transactions to our customers that are affecting real people… but look at what we are ignoring or trying to hide from by only using Averages or Median numbers. The perfect bell curve only exists in labs and math class. My favorite thing Gil Tene said was “Right at the center peak of the bell curve, is where the tooth fairies house is.”


This is a great write up - Thanks so much for sharing @reopelle.scott :smiley:

1 Like

Is there a way to plot a bell curve in NR? That is, to plot the distribution of the values of a given attribute of all events during a given period of time?

Hi, @jwbrown: Yes, you may use the histogram() function in NRQL, or switch charts on certain dashboards to histogram view.

1 Like