%CPU Utilization Is A Lie

Brendan Long

I deal with a lot of servers at work, and one thing everyone wants to know about their servers is how close they are to being at max utilization. It should be easy, right? Just pull up top or another system monitor tool, look at network, memory and CPU utilization, and whichever one is the highest tells you how close you are to the limits.

A screenshot of a system monitor app showing 24 cores, half of which are at 100% utitilization and half of which are close to 0%. — For example, this machine is at 50% CPU utilization, so it can probably do twice as much of whatever it's doing.

And yet, whenever people actually try to project these numbers, they find that CPU utilization doesn't quite increase linearly. But how bad could it possibly be?

To answer this question, I ran a bunch of stress tests and monitored both how much work they did and what the system-reported CPU utilization was, then graphed the results.

Setup

For my test machine, I used a desktop computer running Ubuntu with a Ryzen 9 5900X (12 core / 24 thread) processor. I also enabled Precision Boost Overdrive (i.e. Turbo).

I vibe-coded a script that runs stress-ng in a loop, first using 24 workers and attempting to run them each at different utilizations from 1% to 100%, then using 1 to 24 workers all at 100% utilization. It used different stress testing method and measured the number of operations that could be completed ("Bogo ops^[1]").

The reason I did two different methods was that operating systems are smart about how they schedule work, and scheduling a small number of workers at 100% utilization can be done optimally (spoilers) but with 24 workers all at 50% utilization it's hard for the OS to do anything other than spreading the work evenly.

Results

You can see the raw CSV results here.

General CPU

The most basic test just runs all of stress-ng's CPU stress tests in a loop.

You can see that when the system is reporting 50% CPU utilization, it's actually doing 60-65% of the actual maximum work it can do.

64-bit Integer Math

But maybe that one was just a fluke. What if we just run some random math on 64-bit integers?

This one is even worse! At "50% utilization", we're actually doing 65-85% of the max work we can get done. It can't possibly get worse than that though, right?

Matrix Math

Something is definitely off. Doing matrix math, "50% utilization" is actually 80% to 100% of the max work that can be done.

In case you were wondering about the system monitor screenshot from the start of the article, that was a matrix math test running with 12 workers, and you can see that it really did report 50% utilization even though additional workers do absolutely nothing (except make the utilization number go up).

Bonus: Nginx

Someone on Hacker News suggested running a real benchmark, so I ran the Phoronix Test Suite Nginx benchmark pinned to 1-24 cores (unfortunately I can't control the CPU utilization more finely, so you only get one graph).

Here, we get reported utilization starting as an underestimate and then getting worse. At 50% reported utilization, we're actualy at 80% of max requests per second, and at 80% reported utilization we're actually at 100% of our request capacity.

What's Going On?

SMT / Hyperthreading

You might notice that this the graph keeps changing at 50%, and I've helpfully added piecewise linear regressions showing the fit.

The main reason this is happening is Simultaneous Multithreading (SMT) / hyperthreading: Half of the "cores" on this machine (and most machines) are sharing resources with other cores. If I run 12 workers on this machine, they each get scheduled on their own physical core with no shared resources, but once I go over that, each additional worker is sharing resources with another. In some cases (general CPU benchmarks), this makes things slightly worse, and in some cases (SIMD-heavy matrix math), there are no useful resources left to share.

Turbo

It's harder to see, but Turbo is also having an effect. This particular processor runs at 4.9 GHz at low utilization, but slowly drops to 4.3 GHz as more cores become active^[2].

Note the zoomed-in y-axis. The clock speed "only" drops by 15% on this processor.

Since CPU utilization is calculated as busy cycles / total cycles, this means the denominator is getting smaller as the numerator gets larger, so we get yet another reason why actual CPU utilization increases faster than linearly.

Does This Matter?

If you look at CPU utilization and assume it will increase linearly, you're going to have a rough time. If you're using the CPU efficiently (running above "50%" utilization), the reported utilization is an underestimate, sometimes significantly so.

And keep in mind that I've only shown results for one processor, but SMT performance and Turbo behavior can vary wildly between different processors, especially from different companies (AMD vs Intel).

The best way I know to work around this is to run benchmarks and monitor actual work done:

Benchmark how much work your server can do before having errors or unacceptable latency.
Report how much work your server is currently doing.
Compare those two metrics instead of CPU utilization.

^{^}
Bogo ops is presumably a reference to BogoMIPS, a "bogus" benchmark that Linux does at startup to very roughly understand CPU performance.
^{^}
One of the main constraints processors operate under is needing to dissipate heat fast enough. When only one core is running, the processor can give that core some of the heat headroom that other cores aren't using and run it faster, but it can't do that all of the cores are running.
Power usage works similarly and can be a constraint in some environments (usually not in a desktop computer, but frequently in servers).

This makes me think about the battery indicator on my electric bike.

The battery indicator underestimates how much capacity is remaining. It sounds like with CPU utilization, the issue is that it overestimates how much capacity is remaining.
How much capacity remains depends on how it is being used. Slope and pedal assist level are probably the two big things for e-bike batteries.

There seems to be a more general problem at play here, where some metric M is reported, what people actually care about is N, and M appears to be a better proxy for N than it actually is. I feel like in these situations what'd make sense would be to replace M with something less suggestive.

I also kinda feel like the concept of disguised queries applies here. With bleggs, it's helpful to realize that you're not really asking whether it's a blegg, you're asking whether it has vanadium. With batteries, you're not really asking how much battery is remaining, you're asking how many miles you could ride. And with CPU utilization, you're not really asking what the raw number is, you're asking how much more work you could do before you reach the limit. I might be misunderstanding or misapplying the concept of disguised queries though.

Yeah, I think the issue with batteries is actually even more similar to CPUs than you'd expect. Even on a "how much power is left" level, I think the software is trying to guess based on the voltage and some data about similar batteries. The actual amount of usable power is how much you can pull before the voltage drops to an unusable level, and that depends on things like manufacturing, age, temperature, the whims of chemical reactions, etc.

In the case of batteries, the software can make some reasonable assumptions and give you a helpfully-pessimistic estimate, but it's hard to do that with CPUs since the possible range is so workload-dependent. In my matrix math / SIMD example, I'm doing 100% of the possible matrix math while using half the cores, but I could do some amount of other work using hyperthreads if it used the right resources. So the optimistic metric can overestimate by up to 100%, but that also means a pessimistic estimate would underestimate by up to 100%, and neither of those are particularly useful.

Even then, sometimes the assumption isn't pessimistic enough and users will complain about e.g. their cell phone's battery dropping immediately from 10% to 0% when they launch an app that causes the CPU to demand more current than the almost-dead battery can provide.

It’s even worse with real workloads, where I/O, shared data contention, cache variance, and simple differences in unit of work size all make it impossible to predict the optimal parallel vs queue vs reject decision.

There is another factor that is often relevant here, which is queuing theory. There's a famous chart (scroll down here), plotting "Percent of capacity used" on the x-axis, and "Expected wait time" on the y-axis. Wait time is approximately flat until the system reaches 70-80% load, at which point the wait time soars. This is used to decide things like, "How many ticket windows do we need to open at the amusement park?" But it also matters a lot for web servers and for many batch-processing systems.

A system approaching 70% capacity may see dramatic performance degradation as a result of a small load increase. Not all systems behave like this, but many systems do meet the necessary criteria to degrade in this way.

When designing monitoring systems, I often switch the load graph's color to yellow around 60%, and to red by 80%. If your database's CPU or disks are hitting 70% utilization, your system may be about to fall off a cliff.

Running systems near 100% utilization with acceptable wait times often requires prioritization, fairness constraints, reserving the last bit of capacity for latency-sensitive requests, etc. This can involve years of engineering effort to tune in the hardest cases. This is one reason the cloud is popular: When you hit 70% utilization, or when your wait times start to climb, you can just spin up another server instead of doing heavy engineering to squeeze out the last 15-20% of capacity without degrading latency.

Thanks, I noticed something like this for matrix multiplication (while using the builtin system monitor on my laptop to keep track of CPU usage) but assumed the fact that it couldn't do twice as much as it was doing at ~40% CPU meant I must've been doing something wrong.

Yeah, it's really confusing in some cases.

The appropriate term is Simultaneous multithreading which comes from the ISCA 1995 paper (Simultaneous multithreading: Maximizing on-chip parallelism: D.M. Tullsen; S.J. Eggers; H.M. Levy)

HyperThreading is what Intel calls it.

Thanks, I updated this to mention both SMT and hyperthreading.