Linux Load Avgs

Modified: August 9, 2021
Published: August 9, 2021

Tags:

Ref: https://web.archive.org/web/20210805194024/https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

Excerpt from blog post

Linux load averages are “system load averages” that show the running thread (task) demand on the system as an average number of running plus waiting threads. This measures demand, which can be greater than what the system is currently processing. Most tools show three averages, for 1, 5, and 15 minutes:

Some interpretations:

If the averages are 0.0, then your system is idle.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
If they are higher than your CPU count, then you might have a performance problem (it depends).

Making sense of Linux load averages

I grew up with OSes where load averages meant CPU load averages, so the Linux version has always bothered me. Perhaps the real problem all along is that the words “load averages” are about as ambiguous as “I/O”. Which type of I/O? Disk I/O? File system I/O? Network I/O? … Likewise, which load averages? CPU load averages? System load averages? Clarifying it this way lets me make sense of it like this:

On Linux, load averages are (or try to be) "system load averages", for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren't completely idle. Advantage: includes demand for different resources.
On other OSes, load averages are "CPU load averages", measuring the number of CPU running + CPU runnable threads. Advantage: can be easier to understand and reason about (for CPUs only).

Note that there’s another possible type: “physical resource load averages”, which would include load for physical resources only (CPU + disk).

Perhaps one day we’ll add additional load averages to Linux, and let the user choose what they want to use: a separate “CPU load averages”, “disk load averages”, “network load averages”, etc. Or just use different metrics altogether.

Better Metrics

When Linux load averages increase, you know you have higher demand for resources (CPUs, disks, and some locks), but you aren’t sure which. You can use other metrics for clarification. For example, for CPUs:

per-CPU utilization: eg, using mpstat -P ALL 1
per-process CPU utilization: eg, top, pidstat 1, etc.
per-thread run queue (scheduler) latency: eg, in /proc/PID/schedstats, delaystats, perf sched
CPU run queue latency: eg, in /proc/schedstat, perf sched, my runqlat bcc tool.
CPU run queue length: eg, using vmstat 1 and the 'r' column, or my runqlen bcc tool.

Let me know if you have any questions or comments.
It will help me to improve/learn.