Building an NDK Video Player from Scratch: Pipelines, Threads, and State Machines First
Do not rush into writing AMediaCodec code. The very first step to building a truly stable video player is not decoding; it is defining the architecture: "Who is responsible for what, when can they do it, and how does the system return to a safe state after an error."
If a video player is a micro-factory, the video file is the raw material, the demuxer is the unpacking worker, the decoder is the processing machine, the Surface is the display window, and the State Machine is the factory manager. Without the manager, workers operate independently. The machines might spin occasionally, but the instant the user pauses, seeks, or backgrounds the app, the entire assembly line collapses.
Establish Intuition: What Exactly Does a Player Do?
A minimal video player fundamentally executes four operations:
Video File
-> Demux: Extract compressed video and audio frames from the container.
-> Decode: Decompress frames into displayable image buffers or playable PCM audio.
-> Sync: Determine the exact microsecond a visual frame should be displayed and audio should be played.
-> Output: Submit video to the Surface, submit audio to AudioTrack/AAudio.
Let us define the core terminology:
Container: The outer shell of the file. MP4, MKV, and MOV are container formats. Think of them as a folder that simultaneously houses a video track, an audio track, a subtitle track, cover art, and metadata.
Track: An independent data stream within the container. An MP4 typically contains one video track and one audio track.
Compressed Frame (Packet): A frame that has not yet been decoded. Codecs like H.264, H.265, and AAC compress raw images or sound to save space; the player must decode them first.
PTS (Presentation Time Stamp): The exact playback time this specific frame must appear on screen. A player does not "display a frame as soon as it reads it"; it "reads a frame and holds it until its designated presentation time arrives."
Why You Cannot Just Write a while(true) Decoding Loop
When learning NDK player development, it is dangerously easy to write this anti-pattern:
while (true) {
readPacket();
decodePacket();
renderFrame();
}
This code might validate your API calls, but it is not a player architecture. The reason is brutal: Real users do not just press play and watch a video from start to finish.
Users execute these chaotic actions:
Open file
Pause
Resume
Scrub the seek bar
Rotate screen (Portrait/Landscape)
Push app to background
Return to foreground
Close the UI
Rapid-fire repetitive clicks
When these actions occur, the demux thread, the decoder, the Surface, and the audio output could all be trapped in different, incompatible states. Without a robust state machine, you are forced to rely on a brittle web of if statements. This inevitably results in black screens, old frames flashing, thread deadlocks, and fatal native crashes.
The Four Architectural Planes of a Player
1. The Control Plane
The Control Plane strictly handles user commands and lifecycles. It does not read files, and it does not decode data.
open / play / pause / seek / stop / release
The absolute core of the Control Plane is the State Machine. It evaluates whether a current command is structurally legal, and routes legitimate commands to the correct worker threads.
2. The Data Plane
The Data Plane executes the heavy lifting of the media streams.
DemuxEngine -> PacketQueue -> DecodeEngine -> FrameQueue -> RenderEngine
DemuxEngine: Solely responsible for unpacking the container.
DecodeEngine: Solely responsible for feeding the codec and retrieving output buffers.
RenderEngine: Solely responsible for submitting frames to the Surface synchronized against the clock.
3. The Clock Plane
The Clock Plane governs "when to play." Video frames and audio frames cannot run free; they must synchronize against a Master Clock.
For this series, we adopt the industry standard: The Audio Clock operates as the Master Clock, and Video tracks it.
4. The Observability Plane
The Observability Plane manages logging, telemetry, and debugging metrics. Diagnosing player anomalies relies entirely on forensic evidence; a single decode failed log is useless.
You must minimally track these metrics:
state
command_serial
thread_name
packet_queue_size
frame_queue_size
video_pts_us
audio_clock_us
av_drift_us
first_frame_ms
seek_cost_ms
The Minimal State Machine
Begin with a highly conservative, deterministic state machine.
Idle
-> Open
Preparing
-> Prepared
Ready
-> Play
Playing
-> Pause
Paused
-> Play
Playing
-> Seek
Seeking
-> Prepared
Ready
-> Play
Playing
-> EndOfStream
Ended
From ANY state:
-> Error
-> Release
-> Released
Translate this directly into C++ enumerations:
enum class PlayerState {
Idle,
Preparing,
Ready,
Playing,
Paused,
Buffering,
Seeking,
Ended,
Error,
Releasing,
Released,
};
enum class PlayerCommandType {
Open,
Prepared,
Play,
Pause,
Seek,
EndOfStream,
Fail,
Release,
};
struct PlayerCommand {
PlayerCommandType type;
int64_t argumentUs;
int64_t serial;
};
The serial parameter is a monotonic command sequence ID. It acts like a deli ticket: a delayed, older seek command cannot accidentally override a newer seek command.
The Command Queue: Serializing UI Chaos
Never allow the Kotlin layer to invoke nativePlay() or nativePause() and instantly mutate native state. The architecturally sound approach is to push commands into a native Command Queue, allowing a dedicated Control Thread to consume them sequentially.
class CommandQueue {
public:
void push(PlayerCommand command) {
std::lock_guard<std::mutex> lock(mutex_);
commands_.push(command);
cv_.notify_one();
}
bool pop(PlayerCommand* out) {
std::unique_lock<std::mutex> lock(mutex_);
cv_.wait(lock, [&] { return !commands_.empty() || stopped_; });
if (commands_.empty()) return false;
*out = commands_.front();
commands_.pop();
return true;
}
void stop() {
std::lock_guard<std::mutex> lock(mutex_);
stopped_ = true;
cv_.notify_all();
}
private:
std::mutex mutex_;
std::condition_variable cv_;
std::queue<PlayerCommand> commands_;
bool stopped_ = false;
};
The brilliance of this code is not the queue itself, but how it transforms "asynchronous, chaotic UI clicks from arbitrary threads" into "ordered, sequential events on the Control Thread." Player stability fundamentally originates here.
The Control Thread: Mutating State, Not Data
class PlayerController {
public:
void loop() {
PlayerCommand command{};
while (queue_.pop(&command)) {
handle(command);
if (state_ == PlayerState::Released) break;
}
}
private:
void handle(const PlayerCommand& command) {
switch (command.type) {
case PlayerCommandType::Open:
if (state_ == PlayerState::Idle) enterPreparing(command);
break;
case PlayerCommandType::Play:
if (state_ == PlayerState::Ready || state_ == PlayerState::Paused) enterPlaying(command);
break;
case PlayerCommandType::Pause:
if (state_ == PlayerState::Playing) enterPaused(command);
break;
case PlayerCommandType::Seek:
if (state_ == PlayerState::Playing || state_ == PlayerState::Paused) enterSeeking(command);
break;
case PlayerCommandType::Release:
enterReleasing(command);
break;
default:
break;
}
}
PlayerState state_ = PlayerState::Idle;
CommandQueue queue_;
};
Notice the explicit absence of AMediaExtractor_readSampleData or AMediaCodec_dequeueInputBuffer. The Control Thread exclusively orchestrates; heavy lifting is delegated to Worker Threads.
Worker Thread Boundaries
A minimum of three dedicated Worker Threads is required:
demuxThread: Ingests the container, yields Compressed Packets.
decodeThread: Consumes Compressed Packets, yields Raw Image Frames.
renderThread: Submits Raw Image Frames to the Surface synchronized to the clock.
For novices, audio implementation can initially remain at the Kotlin AudioTrack layer. Once the primary video pipeline is stable, audio can be migrated down to C++ via AAudio or Oboe. This isolates complexity and accelerates debugging.
The Highest-Risk Lifecycle Threat: The Surface
Surface is a display target managed by the Android Graphics System (SurfaceFlinger); it is not a standard C++ object. When an Activity goes to the background, rotates, or dies, the Surface may be destroyed long before your native player is torn down.
Therefore, the Control Layer must treat Surface lifecycle events as standard Commands:
surfaceCreated -> AttachSurface
surfaceDestroyed -> DetachSurface
activityDestroy -> Release
Never execute delete codec directly inside surfaceDestroyed, and never allow the Render Thread to secretly hold a pointer to a dead window.
The Novice's Minimum Viable Experiment
Step 1: Implement only the state machine. Do not connect any Media APIs yet.
Open -> Prepared -> Play -> Pause -> Play -> Seek -> Prepared -> Play -> Release
Verify that the logs reflect a perfect, deterministic state progression:
Idle -> Preparing
Preparing -> Ready
Ready -> Playing
Playing -> Paused
Paused -> Playing
Playing -> Seeking
Seeking -> Ready
Ready -> Playing
Playing -> Releasing -> Released
Step 2: Inject "dummy frames." Generate a fake frame every 33ms to simulate a 30fps stream.
Step 3: Only after the engine proves stable under dummy loads should you wire in AMediaExtractor and AMediaCodec.
Engineering Acceptance Criteria
A player's architectural skeleton is deemed acceptable only if it passes these assertions:
Worker threads gracefully exit after a Release command.
Rapid-fire Play/Pause spamming does not induce illegal states.
Rapid-fire sequential Seeks execute only the final valid Seek command.
The Render Thread immediately halts frame submission after Surface destruction.
Every error telemetry event explicitly includes: State, Serial, Thread Name, and Root Cause.
Conclusion
An NDK player is not an exercise in stringing APIs together; it is a complex, highly concurrent system. By rigorously establishing the State Machine, Command Queues, Thread Boundaries, and Lifecycles first, you immunize your architecture against the race conditions and timing anomalies that inevitably plague demuxing and decoding logic.