正在切换页面...

Extracting Video Frames from MP4: Understanding Demuxing with AMediaExtractor

mediumAndroidNDKAMediaExtractorDemuxPlayerUpdated

In the previous chapter, we established the core architectural skeleton of the player. Now, we engage the first true data pipeline: extracting compressed frames from media containers (like MP4) so they can be fed into the decoder.

This operation is called "Demuxing" (De-multiplexing). It is crucial to understand that Demuxing is not Decoding. Demuxing is merely "unpacking the box"; Decoding is "transforming the compressed payload inside the box back into raw images or sound."

Establish Intuition: The Media Container

Conceptualize an MP4 file as a standardized shipping crate. Inside this crate, various streams are packed together:

MP4 Container
  Video Track: H.264/H.265 compressed frames
  Audio Track: AAC/Opus compressed frames
  Subtitle Track: Text or bitmap overlays
  Metadata: Duration, Dimensions, Rotation Matrix, Bitrate

AMediaExtractor is your NDK crowbar. It opens the crate, parses the inventory of available tracks, allows you to select specific tracks, and then reads the payloads out sequentially, frame by frame.

Core Terminology

Track: An independent, continuous data stream. Video, audio, and subtitles exist on entirely separate tracks.

Sample: A discrete chunk of data extracted from the container. For video, a sample is almost always exactly one compressed frame. (In NDK parlance, "Sample" often equates to "Packet").

PTS (Presentation Time Stamp): The exact microsecond this specific sample is scheduled to be displayed or played. 1,000,000 us equals 1 second.

Key Frame (Sync Frame): Video compression heavily relies on referencing previous frames. Most frames only encode the delta (differences) from previous images. A Key Frame, however, is a complete, self-contained snapshot. When executing a Seek operation, you almost always must land on a Key Frame.

The Architectural Boundaries of AMediaExtractor

AMediaExtractor is highly specialized. It executes three precise tasks:

1. Binds to a data source (File Descriptor or URI).
2. Parses and exposes Track Formats.
3. Advances through and extracts Samples sequentially.

It explicitly does not do the following:

It will NOT decode H.264 into raw pixels.
It will NOT handle Audio/Video synchronization.
It will NOT interact with your Surface.
It will NOT manage your state machine.

Respecting these API boundaries drastically simplifies debugging.

Binding the Data Source

On modern Android, reading local files via absolute native paths is notoriously brittle due to Scoped Storage. The architecturally sound approach is to obtain a File Descriptor (FD) via ContentResolver or AssetFileDescriptor in Kotlin, and pass that FD down to the native layer.

// Kotlin Boundary
val afd = contentResolver.openAssetFileDescriptor(uri, "r") ?: return
val fd = afd.parcelFileDescriptor.detachFd()
nativeOpenFd(fd, afd.startOffset, afd.length)

The native layer intercepts the FD and invokes AMediaExtractor_setDataSourceFd:

// Native Boundary
class ExtractorOwner {
public:
    ExtractorOwner() : extractor_(AMediaExtractor_new()) {}

    ~ExtractorOwner() {
        if (extractor_ != nullptr) {
            AMediaExtractor_delete(extractor_);
        }
    }

    AMediaExtractor* get() const { return extractor_; }

private:
    AMediaExtractor* extractor_ = nullptr;
};

bool openExtractor(AMediaExtractor* extractor, int fd, int64_t offset, int64_t length) {
    media_status_t status = AMediaExtractor_setDataSourceFd(extractor, fd, offset, length);
    return status == AMEDIA_OK;
}

Notice the rigorous application of RAII. The C-API pointer AMediaExtractor* is bound to a C++ lifecycle class. "Resource Acquisition Is Initialization" prevents massive memory leaks when the player is rapidly destroyed.

Parsing and Enumerating Tracks

To map the container's contents, utilize AMediaExtractor_getTrackCount and AMediaExtractor_getTrackFormat.

struct TrackInfo {
    int index = -1;
    std::string mime;
    int32_t width = 0;
    int32_t height = 0;
};

bool readCString(AMediaFormat* format, const char* key, std::string* out) {
    const char* raw = nullptr;
    if (!AMediaFormat_getString(format, key, &raw) || raw == nullptr) {
        return false;
    }
    *out = raw;
    return true;
}

std::optional<TrackInfo> findVideoTrack(AMediaExtractor* extractor) {
    const size_t count = AMediaExtractor_getTrackCount(extractor);
    for (size_t i = 0; i < count; ++i) {
        AMediaFormat* format = AMediaExtractor_getTrackFormat(extractor, i);
        if (format == nullptr) continue;

        std::string mime;
        bool hasMime = readCString(format, AMEDIAFORMAT_KEY_MIME, &mime);

        TrackInfo info;
        info.index = static_cast<int>(i);
        info.mime = mime;
        AMediaFormat_getInt32(format, AMEDIAFORMAT_KEY_WIDTH, &info.width);
        AMediaFormat_getInt32(format, AMEDIAFORMAT_KEY_HEIGHT, &info.height);

        // CRITICAL: Prevent memory leaks.
        AMediaFormat_delete(format);

        if (hasMime && mime.rfind("video/", 0) == 0) {
            return info;
        }
    }
    return std::nullopt;
}

Mandatory Warning: The AMediaFormat* returned by AMediaExtractor_getTrackFormat must be explicitly destroyed via AMediaFormat_delete. Failure to do so will leak memory on every track iteration.

Selecting the Target Track

Once the video track is identified, you must explicitly select it.

bool selectVideoTrack(AMediaExtractor* extractor, const TrackInfo& track) {
    media_status_t status = AMediaExtractor_selectTrack(extractor, track.index);
    return status == AMEDIA_OK;
}

Post-selection, the extraction APIs (readSampleData, getSampleTrackIndex, getSampleTime) will only yield data from selected tracks. If you do not select the audio track, you will never see an audio sample.

Extracting Samples Sequentially

The execution sequence for extraction is rigid:

1. readSampleData
2. getSampleTime
3. getSampleFlags
4. getSampleTrackIndex
5. advance

readSampleData: Copies the current sample payload into your buffer. getSampleTime: Retrieves the PTS for the current sample. advance: Mutates the extractor's internal pointer to the next sequential sample.

struct Packet {
    std::vector<uint8_t> data;
    int64_t ptsUs = 0;
    uint32_t flags = 0;
    int trackIndex = -1;
};

std::optional<Packet> readOnePacket(AMediaExtractor* extractor) {
    const ssize_t sampleSize = AMediaExtractor_getSampleSize(extractor);
    if (sampleSize < 0) {
        return std::nullopt; // End of Stream or Error
    }

    Packet packet;
    packet.data.resize(static_cast<size_t>(sampleSize));

    ssize_t readSize = AMediaExtractor_readSampleData(
        extractor,
        packet.data.data(),
        packet.data.size()
    );
    if (readSize < 0) {
        return std::nullopt;
    }

    packet.data.resize(static_cast<size_t>(readSize));
    packet.ptsUs = AMediaExtractor_getSampleTime(extractor);
    packet.flags = AMediaExtractor_getSampleFlags(extractor);
    packet.trackIndex = AMediaExtractor_getSampleTrackIndex(extractor);

    // CRITICAL: Advance must happen AFTER reading metadata.
    AMediaExtractor_advance(extractor);
    return packet;
}

Reversing this sequence is fatal. If you call advance first, the extractor shifts, and you will read the PTS of the next frame instead of the frame you just buffered.

Throttling via Bounded Queues

Never allow the Demux Thread to extract frames infinitely. A 4K video file is massive; if the decoder stalls, an unthrottled demuxer will rapidly devour all available RAM, triggering an Out-Of-Memory (OOM) termination.

class PacketQueue {
public:
    explicit PacketQueue(size_t capacity) : capacity_(capacity) {}

    bool push(Packet packet) {
        std::unique_lock<std::mutex> lock(mutex_);
        if (queue_.size() >= capacity_) {
            return false; // Queue is full, enforce backpressure.
        }
        queue_.push(std::move(packet));
        cv_.notify_one();
        return true;
    }

    bool pop(Packet* out) {
        std::unique_lock<std::mutex> lock(mutex_);
        cv_.wait(lock, [&] { return !queue_.empty() || stopped_; });
        if (queue_.empty()) return false;
        *out = std::move(queue_.front());
        queue_.pop();
        return true;
    }

    void flush() {
        std::lock_guard<std::mutex> lock(mutex_);
        std::queue<Packet> empty;
        queue_.swap(empty);
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::queue<Packet> queue_;
    size_t capacity_ = 0;
    bool stopped_ = false;
};

This bounded queue introduces "backpressure." When the queue hits capacity, the Demux Thread must sleep, preventing memory exhaustion.

The Mechanics of Seeking to Key Frames

If a user scrubs to 01:30, you cannot simply jump to arbitrary frame X. If X is a delta frame lacking complete image data, the decoder will output corrupted, glitching visuals ("macroblock tearing") until the next Key Frame arrives.

Therefore, seeking is explicitly directed to sync points:

AMediaExtractor_seekTo(
    extractor,
    targetUs,
    AMEDIAEXTRACTOR_SEEK_PREVIOUS_SYNC
);

PREVIOUS_SYNC commands the extractor to jump to the closest Key Frame immediately preceding the target timestamp, ensuring the decoder receives a clean baseline image.

Diagnostic Signatures

Black Screen (No Crash): The extractor is yielding packets, but the Decoder configuration is mismatched. Audit the MIME type, dimensions, and csd-0/csd-1 (Codec Specific Data) byte buffers. Visual Glitch on Seek: The Packet Queue was not flushed during the Seek command. New timeline packets are colliding with stale timeline packets in the decoder. Monotonic RAM Bloat: The Packet Queue lacks a hard capacity limit, or you are failing to call AMediaFormat_delete.

Laboratory Verification

Before integrating AMediaCodec, execute a pure Demux test. Extract and log the sample metadata directly to the terminal:

track=0 size=4211 ptsUs=0 flags=1 (Key Frame)
track=0 size=932 ptsUs=33333 flags=0 (Delta)
track=0 size=887 ptsUs=66666 flags=0 (Delta)

Verify these invariant rules:

1. ptsUs must increase monotonically.
2. sampleSize must not be zero for prolonged sequences.
3. trackIndex strictly matches the selected video track ID.
4. When EOF is reached, `advance` yields false or size goes negative.

Engineering Risks and Telemetry

The Demux pipeline demands specific observability metrics:

packet_queue_size (Detects downstream decoder stalls)
sample_pts_us (Detects timeline corruption)
sample_track_index (Detects demuxing cross-talk)
read_timeout_count (Detects bad I/O)
seek_serial (Detects overlapping UI seek spam)

If the queue spikes, apply backpressure. If PTS flows backwards, the container is corrupted or your seek logic is flawed. If reads fail, do not instantly trigger a fatal state; differentiate between recoverable network timeouts and permanent EOF/permission aborts.

Conclusion

Demuxing lacks the visual glamour of rendering, but it dictates the absolute purity of the downstream pipeline. A robust demuxing layer yields clean payload data, accurate track isolation, precise timestamps, and strict memory bounds.

References