I recently had an eye opening experience with regards to measuring latency
Now that I have a better understanding of what averages demonstrate, I see some many NRQL examples that use “Average()” and I’m just trying to pay this forward.
The mans name is Gil Tene and I promise you this is worth reading if you create any dashboards. He also has several youtube videos called how NOT to measure latency.
Here’s a visual example of what the video and blog taught me:
This is real data from a site that said everything looked good, but we were getting a lot of complaints.
These charts were NOT tampered with
Average and Median Duration Response Time
After Adding 95th and 99th Percentile Duration Response Time
At worst, with this chart we are ignoring 1% of our real results
Let’s add the 99.9 Duration Response Time
99.9th is VERY close to the max right? How bad could it be
Max Duration Response Time
OUCH These are not outliers, they are consistent transactions to our customers that are affecting real people… but look at what we are ignoring or trying to hide from by only using Averages or Median numbers. The perfect bell curve only exists in labs and math class. My favorite thing Gil Tene said was “Right at the center peak of the bell curve, is where the tooth fairies house is.”