That’s 100% correct @philweber. I had to do a lot of reading in the attribute dictionary to understand the data is not always what it seems to be.
cpuPercent is an aggregate metric defined as
Total CPU utilization as a percentage. This is not an actual recorded value; it is an alias that combines percentage data from cpuSystemPercent, cpuUserPercent, cpuIoWaitPercent and cpuStealPercent.
cpuPercent can be helpful for trends but if you use it for alerting, you will chase your tail and generate a lot of pager fatigue!
To measure cpu strain I have had success using loadAverageFiveMinute. A measurement approaching 40 in my environment tells me the CPU is not able to keep up and login times, jobs, system tasks like patch scans, A/V scans, etc… begin to suffer or don’t run entirely.
This metric is defined as:
Over the last 5 minutes, the average number of system processes, threads, or tasks that are waiting and ready for CPU time.
Again though, you have to understand that this can be impacted by thread counts and other application configurations. Apps that generate tons of threads can quickly exhaust the constraints what a CPU can handle at any one time. I had to work with some teams to identify their issues, tweaking how many threads were running at any one time, even though the threads were tiny and the CPU appears relatively available. The CPU had to send “Waits” because it couldn’t keep up.