User Mode and Kernel Mode
Privilege Levels and CPU Protection Rings
Modern CPUs (taking x86 as an example) provide 4 privilege levels, known as Ring 0 through Ring 3. The lower the number, the higher the privilege:
┌───────────────────────┐
│ Ring 0 │ ← Kernel Mode (Highest Privilege)
│ OS Kernel │
├───────────────────────┤
│ Ring 1 / Ring 2 │ ← Rarely Used
│ (Drivers/Services) │
├───────────────────────┤
│ Ring 3 │ ← User Mode (Lowest Privilege)
│ Applications │
└───────────────────────┘
Most operating systems (including Linux) utilize only two levels:
- Ring 0: Kernel Mode, executing OS kernel code.
- Ring 3: User Mode, executing application code.
Why skip Ring 1 and Ring 2? The segment protection mechanisms in x86 are excessively complex, leading most OS architects to favor a simplified model. Conversely, the ARM architecture directly provides only two primary execution levels (EL0 for user mode, EL1 for kernel mode).
The Essential Difference Between User Mode and Kernel Mode
The difference lies not in the code itself, but in the CPU's current privilege level. The exact same machine code running in Ring 0 executes in Kernel Mode, whereas in Ring 3, it executes in User Mode.
| Dimension | User Mode (Ring 3) | Kernel Mode (Ring 0) |
|---|---|---|
| Executable Instructions | Safe instruction subset | All instructions (including privileged) |
| Memory Access | Restricted to own virtual address space | Can access all physical memory and all process spaces |
| Hardware Access | Cannot directly operate I/O ports | Direct control over all hardware |
| Failure Consequence | Process Crash (Segfault) | System Crash (Kernel Panic) |
| Context Switch Overhead | N/A (Normal execution) | 500~1500 CPU cycles per switch |
Examples of Privileged Instructions
The following operations can only be executed in Kernel Mode:
- I/O Operations:
in/outinstructions, directly reading/writing hardware ports. - Modifying Page Tables: Manipulating the CR3 register to alter virtual memory mappings.
- Disabling Interrupts:
cliinstruction, preventing the CPU from responding to external interrupts. - Adjusting Privilege Levels: Modifying the CPL (Current Privilege Level) within the CS register.
If a user-mode program attempts to execute a privileged instruction, the CPU immediately triggers a General Protection Fault (#GP), and the operating system typically responds by terminating the offending process.
The Switching Mechanism: User Mode → Kernel Mode
The essence of a mode switch is elevating the CPU privilege level from Ring 3 to Ring 0. This is triggered in three ways:
1. System Calls (Active Trigger)
When an application requires a kernel service, it actively initiates a system call. This is the most common switching mechanism.
Application (Ring 3) Kernel (Ring 0)
│ │
│ 1. Put syscall number in eax │
│ 2. Put args in ebx, ecx, etc. │
│ 3. Execute syscall / int 0x80 │
│ ─────────────────────────────────▶ │
│ │ 4. CPU Automatically:
│ │ - Saves user rsp, rip
│ │ - Switches to kernel stack
│ │ - Jumps to syscall entry
│ │ 5. Executes via dispatch table
│ │ 6. Execution completes
│ ◀───────────────────────────────── │
│ 7. sysret / iret returns │
│ Restores user context │
Two System Call Mechanisms on x86-64:
| Mechanism | Instruction | Performance | Description |
|---|---|---|---|
| Legacy | int 0x80 |
Slow | Triggered via software interrupt; requires the full interrupt handling pipeline. |
| Fast Path | syscall / sysenter |
Fast | Dedicated instructions; bypasses the Interrupt Descriptor Table (IDT) for direct kernel entry. |
Modern Linux utilizes the syscall instruction (64-bit) and sysenter (32-bit) to optimize system call throughput.
2. Exceptions (Passive Trigger — Synchronous)
When an error or special condition occurs during program execution, the CPU automatically triggers an exception:
| Exception Type | Trigger Cause | Kernel Handling Strategy |
|---|---|---|
| Page Fault | Accessing an unmapped virtual page | Allocates a physical page and establishes the mapping. |
| Divide-by-Zero | Executing a division by zero | Sends a SIGFPE signal to the process. |
| Segmentation Fault | Accessing an illegal memory address | Sends a SIGSEGV signal to the process. |
| Breakpoint | Executing a breakpoint instruction (int 3) |
Notifies the attached debugger. |
The Page Fault is arguably the most "useful" exception—the virtual memory subsystem leverages it to implement Demand Paging. Programs do not load all pages into memory upon startup; pages are loaded dynamically via Page Faults only when explicitly accessed.
3. Hardware Interrupts (Passive Trigger — Asynchronous)
When peripheral devices complete an operation or require CPU attention, they send electrical signals to the CPU via the interrupt controller:
Keyboard Key Pressed ──▶ Interrupt Controller ──▶ CPU Interrupt Pin
│
▼
CPU Suspends Current Task
Saves Context
Switches to Kernel Mode
Executes Interrupt Handler
Restores Context
Resumes User Task
Common hardware interrupts include:
- Clock Interrupts: A timer fires at fixed intervals (e.g., 1ms / 4ms), driving process scheduling.
- I/O Interrupts: Disk read completion, network card packet arrival.
- Keyboard Interrupts: User keystrokes.
Performance Overhead of Mode Switching
Every switch between User Mode ↔ Kernel Mode demands the following computational toll:
┌──────────────────────────────────────┐
│ 1. Save user registers to k-stack │ ~100 cycles
│ 2. Switch to kernel stack (mod RSP) │ ~10 cycles
│ 3. Execute security checks │ ~50 cycles
│ 4. Execute kernel code │ Varies by operation
│ 5. Restore user registers │ ~100 cycles
│ 6. Switch back to user stack │ ~10 cycles
│ 7. Flush TLB / Pipeline │ ~200 cycles
├──────────────────────────────────────┤
│ Total overhead: ~500 - 1500 cycles │
│ ~0.2 - 0.5 microseconds on a 3GHz CPU│
└──────────────────────────────────────┘
While a single switch is microscopic, under high-frequency workloads (e.g., tens of thousands of network I/O operations per second), context switch overhead becomes a massive bottleneck. This architectural constraint is precisely why Linux introduced mechanisms like epoll and io_uring—to drastically reduce the raw volume of system calls.
vDSO: "System Calls" Without the Kernel Switch
Linux provides the vDSO (virtual Dynamic Shared Object) mechanism, which maps specific read-only kernel data directly into user space. This allows certain "system calls" to be resolved entirely in user mode, bypassing the mode switch overhead entirely:
| Syscall | Traditional Execution | vDSO Execution |
|---|---|---|
gettimeofday() |
Enters kernel to read clock | Reads user-mapped page directly |
clock_gettime() |
Enters kernel to read clock | Reads user-mapped page directly |
getcpu() |
Enters kernel to query CPU ID | Reads user-mapped page directly |
The defining characteristic of these calls is that they only read data and never mutate system state, making them safe to execute in user mode.
System Design Audit & Observability
When engineering high-performance systems, understanding the user/kernel boundary is non-negotiable.
1. The Necessity of Isolation
Why enforce this boundary? Blast radius containment. If all code executed in Ring 0, a single rogue pointer in a logging library could overwrite disk sectors or induce a kernel panic. Privilege separation ensures that a userspace failure (Segfault) remains localized, allowing the OS to cleanly terminate the faulty process without compromising system integrity.
2. Tracing System Call Overhead
To observe the impact of mode switching in production:
- Use
strace -c -p <pid>to profile the distribution and frequency of system calls emitted by a process. - Use
perf statto monitor CPU cycles spent insysvsuserspace. A disproportionately highsystime often indicates inefficient I/O batching or excessive context switching.
3. The Lifecycle of a syscall
The execution pipeline is rigid and deterministic:
- The application loads the syscall ID and arguments into designated registers.
- Execution of
syscall(x86-64) triggers a hardware privilege escalation. - The CPU automatically saves the user context and pivots to the kernel stack.
- The kernel vectors through the syscall table to the designated handler.
- Upon completion, the result is written to
rax. - The
sysretinstruction demotes privileges, restores context, and returns control to user space.