%CPU Utilization Is A Lie — Brendan Long Overview Brendan Long explores why CPU utilization percentages, as reported by common system monitoring tools like top, do not reliably reflect actual CPU work done or capacity utilization. Although people often look at these percentages to judge how close servers are to their maximum capabilities, utilization metrics can be misleading. --- Key Insights CPU utilization is widely used to estimate how much capacity a machine is using. However, CPU utilization numbers do not scale linearly with actual work done. At 50% reported CPU utilization, a machine often does 60-100% of its maximum possible work, depending on the workload. --- Experiment Setup Brendan wrote a script around stress-ng to run various CPU stress tests on a Ryzen 9 5900X (24-core) desktop running Ubuntu. Tests ran two different ways: One worker per core with varying utilization levels (1% to 100%). Multiple workers (1 to N) at 100% utilization each. "Bogo ops" (rough operations count) were measured to track actual computational throughput. Precision Boost Overdrive (AMD Turbo) was enabled. --- Results by Workload General CPU Stress Test At 50% CPU utilization reading, the actual work done is about 60-65% of max capacity. 64-bit Integer Math Even less linear: at 50% utilization reading, actual work is 65-85% of maximum. Matrix Math (SIMD-heavy) The most pronounced mismatch: at "50%" CPU utilization, actual work ranges from 80% to 100% of max. The initial 50% CPU utilization screenshot showing half cores at 100% and half near 0% was from a matrix math test. --- Why Does This Happen? Hyperthreading The CPU has 24 cores but half are logical cores sharing resources (hyperthreads). When running ≤12 workers (one per physical core), work scheduling is efficient. When exceeding 12 workers, cores share resources, saturating capacity without doubling performance. This results in nonlinear scaling of utilization versus actual throughput. Turbo Frequency Scaling CPU clock speed decreases from ~4.9 GHz at low utilization to ~4.3 GHz at full load. The CPU utilization metric is based on busy cycles / total cycles, but since cycle count changes with frequency, utilization numbers distort. The clock speed drop (~15%) means CPU utilization appears to rise faster than actual capacity. --- Practical Implications CPU utilization percentages alone are not a reliable metric for capacity planning or performance monitoring. At moderate or higher utilization, they underestimate how hard the CPU is actually working. Variability between processors (AMD vs Intel), architectures, and turbo behavior complicates interpretation. Users should benchmark actual work done and compare it against a baseline max throughput rather than relying on utilization percentages. --- Recommended Approach Benchmark how much work a server can do at maximum load without errors or unacceptable latency. Regularly measure how much work the server is currently performing. Compare those real work values instead of depending solely on CPU utilization metrics. --- Additional Notes The term "Bogo ops" references BogoMIPS, a rough Linux CPU benchmark. Thermal and power constraints affect boost frequencies and workloads, influencing utilization readings. This research emphasizes the importance of understanding underlying CPU characteristics to correctly interpret monitoring data. --- Contact Email Brendan at self@brendanlong.com Comment on the post on LessWrong: Link --- License: Creative Commons 0 (Public Domain) Site powered by Pelican and Python.