正在切换页面...

I/O Models

hardOperating SystemI/O ModelsBlocking I/ONon-blocking I/OI/O MultiplexingselectpollepollUpdated

The Two Phases of an I/O Operation

A standard network I/O operation (e.g., calling read()) inherently involves two distinct phases:

Application Space           Kernel Space
    │                            │
    │  read()                    │
    │ ──────────────────────────▶│
    │                            │ Phase 1: Waiting for Data
    │                            │ (Waiting for the NIC to receive
    │                            │ packets and DMA them to kernel buffer)
    │                            │
    │                            │ Phase 2: Copying Data
    │                            │ (Copying data from the kernel buffer
    │  Return Data               │  into the application's user-space buffer)
    │ ◀──────────────────────────│

The fundamental difference between various I/O models is defined by how they handle blocking during these two phases.

The Five I/O Models

1. Blocking I/O (BIO)

User Process                 Kernel
    │                          │
    │  read()                  │            ┐
    │ ────────────────────────▶│            │
    │  Blocked...              │ Wait Data  │ Both phases block
    │  Blocked...              │            │ the calling thread.
    │  Blocked...              │ Copy Data  │
    │ ◀────────────────────────│            ┘
    │  Process Data            │

The thread invoking read() is completely suspended by the OS until the data is fully copied into user space. Architectural Implication: Easiest to program, but catastrophically inefficient at scale. A single thread can only handle one connection at a time. To handle 10,000 connections, you need 10,000 threads, which will destroy the CPU via context-switching overhead.

2. Non-blocking I/O (NIO)

User Process                 Kernel
    │  read()                  │
    │ ────────────────────────▶│ Data not ready
    │ ◀──── EAGAIN ────────────│ Return immediately
    │  read()                  │
    │ ────────────────────────▶│ Data not ready
    │ ◀──── EAGAIN ────────────│ Return immediately
    │  ... Polling Loop ...    │
    │  read()                  │
    │ ────────────────────────▶│ Data is ready!
    │  Blocked...              │ Copy Data (Phase 2 STILL blocks)
    │ ◀────────────────────────│
    │  Process Data            │

The read() system call returns instantly with an error code (usually EAGAIN or EWOULDBLOCK) if data is not ready. The application must continuously poll. Architectural Implication: Prevents thread suspension during Phase 1, but completely wastes CPU cycles spinning in a useless while loop.

3. I/O Multiplexing

The core technology behind modern high-concurrency servers. A single thread monitors multiple file descriptors (fds) simultaneously. It blocks only until at least one fd becomes readable or writable.

User Process                 Kernel
    │                          │
    │  select / poll /         │
    │  epoll_wait              │
    │ ────────────────────────▶│ Simultaneously monitor fd1, fd2, fd3...
    │  Blocked waiting...      │
    │                          │ fd2 is ready!
    │ ◀────────────────────────│ Return list of ready fds
    │                          │
    │  read(fd2)               │
    │ ────────────────────────▶│ Copy Data
    │ ◀────────────────────────│
    │  Process Data            │

4. Signal-Driven I/O

The process registers a signal handler and tells the kernel, "Send me a SIGIO signal when this fd is ready." It doesn't block during Phase 1. Rarely used in modern high-performance backends due to the complexity of signal handling under extreme load.

5. Asynchronous I/O (AIO)

User Process                 Kernel
    │  aio_read()              │
    │ ────────────────────────▶│ Return immediately (Non-blocking)
    │  Do other work...        │
    │                          │ Wait Data
    │                          │ Copy Data to User Space
    │ ◀── Signal/Callback ─────│ Notify ONLY when fully complete
    │  Process Data            │

True Asynchrony: Neither Phase 1 nor Phase 2 blocks the calling thread. The OS copies the data into the user buffer in the background and notifies the application when the data is ready to be processed. Linux's io_uring (introduced in kernel 5.1) is the modern standard for ultra-high-performance AIO.

The Evolution of I/O Multiplexing: select vs. poll vs. epoll

These are the three system calls provided by Linux for I/O Multiplexing. They represent a generational evolution in performance.

1. `select`

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

The Problem: The user must copy the entire set of monitored fds from user space to kernel space on every single call.
The Bottleneck: The kernel must perform a linear scan ($O(N)$) across all fds to determine which ones have data.
The Limit: Hardcoded limit of 1024 file descriptors per process.

2. `poll`

Virtually identical to select, but utilizes an array of pollfd structures instead of fixed-size bitmaps.

Improvement: Removes the 1024 fd limit.
The Bottleneck: Still suffers from the catastrophic $O(N)$ linear scan and the constant user-to-kernel memory copying overhead.

3. `epoll`

The undisputed king of Linux networking.

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

The Three Architectural Breakthroughs of epoll:

Aspect	`select` / `poll`	`epoll`
FD Registration	Entire set copied on every call.	`epoll_ctl` registers an fd once. The kernel maintains them persistently in a Red-Black Tree.
Readiness Detection	Linear scan of all fds — $O(N)$	Event-driven callbacks. When a NIC receives data, a hardware interrupt immediately adds the fd to a ready-list — $O(1)$.
Result Retrieval	Application must iterate over all fds to find the ready ones.	`epoll_wait` directly returns an array containing only the ready fds.

The Two Triggers of `epoll`

Mode	Behavior	Use Case
Level Triggered (LT)	Notifies the application continuously as long as there is unread data in the buffer.	The default. Safer and easier to program (similar to `select` semantics).
Edge Triggered (ET)	Notifies the application strictly once when the fd transitions from unreadable to readable.	Extreme performance. Requires strict Non-Blocking I/O (`EAGAIN` handling) and rigorous loops to drain the buffer entirely, otherwise data will be stranded indefinitely.

System Design Audit & Observability

Mastering I/O models is the prerequisite for debugging high-concurrency network services like Nginx, Redis, or Node.js.

1. The "C10K" Problem and Thread Starvation

If a monolithic Tomcat/Spring Boot server (using default blocking threads) attempts to handle 10,000 concurrent WebSocket connections, the JVM will crash or grind to a halt due to thread context-switching overhead, even if traffic volume is minimal.

Audit Protocol: For persistent, highly concurrent connections (WebSockets, SSE, IoT telemetry), you must verify the architecture utilizes an Event Loop built on epoll (e.g., Netty, Node.js, or Go's netpoller). If a thread pool is configured to max-threads=10000, the architecture is fundamentally flawed.

2. The Edge-Triggered Data Loss Trap

If an engineer configures epoll to use Edge Triggered (ET) mode but fails to read the buffer exhaustively until EAGAIN is returned, the remaining data will be silently stranded. The socket will hang indefinitely because the kernel will not generate another event until new data arrives.

Audit Protocol: Review the network library's read loop. If ET is enabled, the code MUST use O_NONBLOCK file descriptors and MUST loop read() until it hits an EWOULDBLOCK or EAGAIN error. If this loop is missing, the service will randomly drop payloads under heavy load.

3. Monitoring File Descriptor Exhaustion

Since epoll allows a single process to handle millions of connections, the bottleneck shifts from CPU threads to the OS file descriptor limits.

Audit Command: Run cat /proc/sys/fs/file-nr to see total allocated FDs system-wide, and ulimit -n to check the per-process limit. If a high-performance proxy (like Envoy or Nginx) drops connections with Too many open files, you must escalate the nofile limits in /etc/security/limits.conf and the systemd unit file (LimitNOFILE=1048576).

I/O Models

hardOperating SystemI/O ModelsBlocking I/ONon-blocking I/OI/O MultiplexingselectpollepollUpdated

The Two Phases of an I/O Operation

A standard network I/O operation (e.g., calling read()) inherently involves two distinct phases:

Application Space           Kernel Space
    │                            │
    │  read()                    │
    │ ──────────────────────────▶│
    │                            │ Phase 1: Waiting for Data
    │                            │ (Waiting for the NIC to receive
    │                            │ packets and DMA them to kernel buffer)
    │                            │
    │                            │ Phase 2: Copying Data
    │                            │ (Copying data from the kernel buffer
    │  Return Data               │  into the application's user-space buffer)
    │ ◀──────────────────────────│

The fundamental difference between various I/O models is defined by how they handle blocking during these two phases.

The Five I/O Models

1. Blocking I/O (BIO)

User Process                 Kernel
    │                          │
    │  read()                  │            ┐
    │ ────────────────────────▶│            │
    │  Blocked...              │ Wait Data  │ Both phases block
    │  Blocked...              │            │ the calling thread.
    │  Blocked...              │ Copy Data  │
    │ ◀────────────────────────│            ┘
    │  Process Data            │

2. Non-blocking I/O (NIO)

User Process                 Kernel
    │  read()                  │
    │ ────────────────────────▶│ Data not ready
    │ ◀──── EAGAIN ────────────│ Return immediately
    │  read()                  │
    │ ────────────────────────▶│ Data not ready
    │ ◀──── EAGAIN ────────────│ Return immediately
    │  ... Polling Loop ...    │
    │  read()                  │
    │ ────────────────────────▶│ Data is ready!
    │  Blocked...              │ Copy Data (Phase 2 STILL blocks)
    │ ◀────────────────────────│
    │  Process Data            │

3. I/O Multiplexing

User Process                 Kernel
    │                          │
    │  select / poll /         │
    │  epoll_wait              │
    │ ────────────────────────▶│ Simultaneously monitor fd1, fd2, fd3...
    │  Blocked waiting...      │
    │                          │ fd2 is ready!
    │ ◀────────────────────────│ Return list of ready fds
    │                          │
    │  read(fd2)               │
    │ ────────────────────────▶│ Copy Data
    │ ◀────────────────────────│
    │  Process Data            │

4. Signal-Driven I/O

5. Asynchronous I/O (AIO)

User Process                 Kernel
    │  aio_read()              │
    │ ────────────────────────▶│ Return immediately (Non-blocking)
    │  Do other work...        │
    │                          │ Wait Data
    │                          │ Copy Data to User Space
    │ ◀── Signal/Callback ─────│ Notify ONLY when fully complete
    │  Process Data            │

The Evolution of I/O Multiplexing: select vs. poll vs. epoll

These are the three system calls provided by Linux for I/O Multiplexing. They represent a generational evolution in performance.

1. `select`

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

The Problem: The user must copy the entire set of monitored fds from user space to kernel space on every single call.
The Bottleneck: The kernel must perform a linear scan ($O(N)$) across all fds to determine which ones have data.
The Limit: Hardcoded limit of 1024 file descriptors per process.

2. `poll`

Virtually identical to select, but utilizes an array of pollfd structures instead of fixed-size bitmaps.

Improvement: Removes the 1024 fd limit.
The Bottleneck: Still suffers from the catastrophic $O(N)$ linear scan and the constant user-to-kernel memory copying overhead.

3. `epoll`

The undisputed king of Linux networking.

int epoll_create(int size);
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

The Three Architectural Breakthroughs of epoll:

Aspect	`select` / `poll`	`epoll`
FD Registration	Entire set copied on every call.	`epoll_ctl` registers an fd once. The kernel maintains them persistently in a Red-Black Tree.
Readiness Detection	Linear scan of all fds — $O(N)$	Event-driven callbacks. When a NIC receives data, a hardware interrupt immediately adds the fd to a ready-list — $O(1)$.
Result Retrieval	Application must iterate over all fds to find the ready ones.	`epoll_wait` directly returns an array containing only the ready fds.

The Two Triggers of `epoll`

Mode	Behavior	Use Case
Level Triggered (LT)	Notifies the application continuously as long as there is unread data in the buffer.	The default. Safer and easier to program (similar to `select` semantics).
Edge Triggered (ET)	Notifies the application strictly once when the fd transitions from unreadable to readable.	Extreme performance. Requires strict Non-Blocking I/O (`EAGAIN` handling) and rigorous loops to drain the buffer entirely, otherwise data will be stranded indefinitely.

System Design Audit & Observability

Mastering I/O models is the prerequisite for debugging high-concurrency network services like Nginx, Redis, or Node.js.

1. The "C10K" Problem and Thread Starvation

Audit Protocol: For persistent, highly concurrent connections (WebSockets, SSE, IoT telemetry), you must verify the architecture utilizes an Event Loop built on epoll (e.g., Netty, Node.js, or Go's netpoller). If a thread pool is configured to max-threads=10000, the architecture is fundamentally flawed.

2. The Edge-Triggered Data Loss Trap

Audit Protocol: Review the network library's read loop. If ET is enabled, the code MUST use O_NONBLOCK file descriptors and MUST loop read() until it hits an EWOULDBLOCK or EAGAIN error. If this loop is missing, the service will randomly drop payloads under heavy load.

3. Monitoring File Descriptor Exhaustion

Since epoll allows a single process to handle millions of connections, the bottleneck shifts from CPU threads to the OS file descriptor limits.

Audit Command: Run cat /proc/sys/fs/file-nr to see total allocated FDs system-wide, and ulimit -n to check the per-process limit. If a high-performance proxy (like Envoy or Nginx) drops connections with Too many open files, you must escalate the nofile limits in /etc/security/limits.conf and the systemd unit file (LimitNOFILE=1048576).

The Two Phases of an I/O Operation

The Five I/O Models

1. Blocking I/O (BIO)

2. Non-blocking I/O (NIO)

3. I/O Multiplexing

4. Signal-Driven I/O

5. Asynchronous I/O (AIO)

The Evolution of I/O Multiplexing: select vs. poll vs. epoll

1. select

2. poll

3. epoll

The Two Triggers of epoll

System Design Audit & Observability

1. The "C10K" Problem and Thread Starvation

2. The Edge-Triggered Data Loss Trap

3. Monitoring File Descriptor Exhaustion

The Two Phases of an I/O Operation

The Five I/O Models

1. Blocking I/O (BIO)

2. Non-blocking I/O (NIO)

3. I/O Multiplexing

4. Signal-Driven I/O

5. Asynchronous I/O (AIO)

The Evolution of I/O Multiplexing: select vs. poll vs. epoll

1. select

2. poll

3. epoll

The Two Triggers of epoll

System Design Audit & Observability

1. The "C10K" Problem and Thread Starvation

2. The Edge-Triggered Data Loss Trap

3. Monitoring File Descriptor Exhaustion

1. `select`

2. `poll`

3. `epoll`

The Two Triggers of `epoll`

1. `select`

2. `poll`

3. `epoll`

The Two Triggers of `epoll`