CPU Scheduling Algorithms
Why CPU Scheduling is Necessary
In a multiprogramming system, multiple processes compete for limited CPU resources. The CPU Scheduler is responsible for determining which process to execute next and how long it should run.
A robust scheduling algorithm must balance several competing objectives:
| Metric | Definition | Priority For |
|---|---|---|
| CPU Utilization | The percentage of time the CPU is actively processing. | System Administrators |
| Throughput | The number of processes completed per unit of time. | System Administrators |
| Turnaround Time | Total time from submission to completion. | Batch Processing Users |
| Waiting Time | Total time a process spends waiting in the ready queue. | All Users |
| Response Time | Time from submission until the first output/response. | Interactive Users |
Non-preemptive vs. Preemptive Scheduling
| Type | Description | Pros & Cons |
|---|---|---|
| Non-preemptive | Once a process acquires the CPU, it retains it until termination or voluntary yield. | Simple to implement, but risks "starving" other processes. |
| Preemptive | The scheduler can interrupt the running process at any time to switch to another. | Excellent responsiveness, but introduces severe concurrency complexities. |
Modern operating systems almost exclusively rely on preemptive scheduling architectures.
Classic Scheduling Algorithms
1. First-Come, First-Served (FCFS)
The most rudimentary algorithm: the process that arrives in the ready queue first is executed first.
Arrival Order: P1(24ms) → P2(3ms) → P3(3ms)
Execution Gantt Chart:
|---- P1 ----|-- P2 --|-- P3 --|
0 24 27 30
Waiting Time: P1=0, P2=24, P3=27
Average Wait: (0+24+27)/3 = 17ms
Architectural Flaw: The Convoy Effect. A long, CPU-bound process at the front of the queue forces all subsequent short processes to wait endlessly. Observe the difference if the order changes:
Execution Order: P2(3ms) → P3(3ms) → P1(24ms)
Waiting Time: P2=0, P3=3, P1=6
Average Wait: (0+3+6)/3 = 3ms ← Massive reduction!
2. Shortest Job First (SJF)
Prioritizes the process with the shortest estimated execution time. Mathematically proven to yield the minimum average waiting time.
Processes: P1(6ms) P2(8ms) P3(7ms) P4(3ms)
SJF Order: P4(3) → P1(6) → P3(7) → P2(8)
Waiting: P4=0, P1=3, P3=9, P2=16
Average: (0+3+9+16)/4 = 7ms
Architectural Flaw:
- Impossible to perfectly predict future CPU burst times.
- Starvation: Long-running processes may never execute if short processes continuously arrive.
3. Round Robin (RR)
Each process is allocated a fixed Time Quantum (Time Slice). If the process exceeds this quantum, it is preempted and relegated to the back of the queue.
Time Quantum = 4ms
Processes: P1(24ms) P2(3ms) P3(3ms)
|P1 |P2 |P3 |P1 |P1 |P1 |P1 |P1 |
0 4 7 10 14 18 22 26 30
P2 and P3 complete within 10ms (no need to wait for P1 to finish its 24ms burst).
Tuning the Time Quantum:
- Too Large → Degrades into FCFS; abysmal response times.
- Too Small → Context switching overhead dominates, plummeting CPU throughput.
- Modern OS defaults typically range between 10ms and 100ms.
4. Priority Scheduling
Assigns a strict priority value to each process. The CPU is always allocated to the highest-priority runnable process.
Solving Starvation: Aging. The system dynamically elevates the priority of a process the longer it waits in the queue, ensuring it eventually executes.
5. Multi-Level Feedback Queue (MLFQ)
Fuses Round Robin with Priority Scheduling. This is the foundational model for most actual operating systems:
High Priority ┌────────────────────┐ Quantum = 8ms
Queue 0 │ P_new ──▶ Execute │ New processes start at the top
└────────┬───────────┘
Exhausts Quantum │ Demoted
┌────────▼───────────┐ Quantum = 16ms
Queue 1 │ Execute │
└────────┬───────────┘
Exhausts Quantum │ Demoted
┌────────▼───────────┐ Quantum = 32ms
Queue 2 │ Execute (FCFS) │ Lowest tier, batch execution
Low Priority └────────────────────┘
Core Rules:
- New processes enter the highest priority queue.
- If a process consumes its entire quantum (CPU-bound), it is demoted to the queue below.
- If a process yields the CPU before its quantum expires (I/O-bound), it remains in its current tier.
- Periodically, the OS boosts all processes back to the top tier to prevent starvation and adapt to changing behavior.
6. Completely Fair Scheduler (CFS)
CFS is the default process scheduler in the Linux Kernel (since v2.6.23).
Core Philosophy: Discard fixed time slices. Instead, dynamically ensure every process receives a fair proportion of CPU time based on its weight.
Red-Black Tree
(Ordered by vruntime)
┌───┐
│ 30│
╱ ╲
┌───┐ ┌───┐
│ 15│ │ 45│
╱ ╱
┌───┐ ┌───┐
│ 10│ │ 35│
↑
Smallest vruntime
= The NEXT process to be scheduled
vruntime (Virtual Runtime): Calculated as Actual Run Time × (Default Weight / Process Weight). The process with the smallest vruntime is always picked next because it has been "unfairly" starved of CPU time relative to others.
| Process | Nice Value | Weight | vruntime after 10ms actual execution |
|---|---|---|---|
| P1 | 0 (Default) | 1024 | 10ms |
| P2 | -5 (High Priority) | 3121 | 10 × (1024/3121) ≈ 3.3ms |
| P3 | +5 (Low Priority) | 335 | 10 × (1024/335) ≈ 30.6ms |
Because high-priority processes accrue vruntime much slower, they continuously gravitate toward the leftmost node of the Red-Black Tree, guaranteeing they are scheduled far more frequently.
System Design Audit & Observability
Understanding scheduler internals is crucial for optimizing latency-sensitive distributed systems.
1. The Container CPU Throttling Trap
In Kubernetes or Docker environments, processes are often throttled even if overall CPU utilization is low. This occurs because the CFS Quota system (cpu.cfs_quota_us) strictly enforces limits per time window.
- Audit Command: Check
cat /sys/fs/cgroup/cpu/cpu.stat. Ifnr_throttledis rapidly increasing, your application is being hard-paused by the kernel scheduler. Fix this by removing CPU limits (relying only on requests) or increasing the quota.
2. Priority Inversion Mitigation
When a low-priority task holds a lock required by a high-priority task, the system halts. If a medium-priority task preempts the low-priority one, the high-priority task is effectively starved.
- Audit Protocol: Ensure mutexes support Priority Inheritance (the low-priority task temporarily inherits the priority of the waiting high-priority task until it releases the lock). Standard POSIX threads (
PTHREAD_PRIO_INHERIT) support this.
3. Tuning the nice Value for Background Workers
Not all processes deserve equal CFS weighting. Heavy data-crunching background workers should not compete with user-facing API threads.
- Audit Command: Use
renice +10 -p <PID>to voluntarily lower the priority of batch processing scripts. This artificially inflates theirvruntimegrowth rate, causing CFS to defer them when API threads need immediate CPU access.