Audio Output and AVSync: Forcing Audio and Video into Perfect Alignment
Achieving video rendering and audio output independently does not mean you have built a video player. The true engineering challenge lies in forcing two entirely asynchronous data streams onto a unified, mathematically precise timeline.
When an actor's lips move, the audio waveform must detonate simultaneously. If audio precedes video, users perceive "sluggish visuals." If video precedes audio, users experience the disjointed "dubbed movie" effect. This strict synchronization is termed AVSync (Audio/Video Synchronization).
Establish Intuition: The Three Timelines
A player must continuously juggle three entirely distinct definitions of "time."
File Time (PTS): The static timestamp embedded in the media container indicating when a specific frame should appear.
System Time: The monotonic, hardware-backed atomic clock of the device (e.g., CLOCK_MONOTONIC).
Playback Time (Clock): The dynamic, calculated indicator of where the player currently is in the media.
The most fatal error a junior engineer can make is conflating "the frame we just read" with "the frame we should display."
Reading Frame 100: Indicates I/O and Decoding have reached this point.
Displaying Frame 100: Indicates the Playback Clock has authorized this frame for visibility.
Why Audio Acts as the Master Clock
Audio hardware operates with absolute mathematical rigidity. If audio playback fluctuates by even a fraction of a millisecond, the human ear instantly detects glitching or pitch shifting. The human eye, however, is highly forgiving of a video frame dropped to maintain pace.
Consequently, industry-standard architecture dictates that Video slaves to Audio.
Audio Clock: "I am currently outputting the sample for 10.000s."
Video PTS: "My next decoded image is designated for 10.033s."
Verdict: The video frame is 33ms early. Sleep and wait.
If a video PTS is 9.900s, but the Audio Clock has already advanced to 10.000s, the video frame is 100ms late. Displaying it will only drag the visual timeline further into the past. It must be brutally dropped.
The Minimal Audio Vector
To isolate complexity during initial development, architect the audio output as a hybrid system:
Native Layer: Demuxing, Video Decoding, Video Render Scheduling.
Kotlin Layer: AudioTrack for PCM output, broadcasting Audio Clock position back to Native.
Once the AVSync algorithm is bulletproof, the audio pipeline can be fully migrated to NDK via AAudio or Oboe. Proving the math is more critical than rushing to a pure C++ implementation.
In Kotlin, track the exact volume of PCM frames dispatched to AudioTrack.
class AudioClock(private val sampleRate: Int) {
private var writtenFrames: Long = 0
fun onWritePcm(bytes: Int, channelCount: Int, bytesPerSample: Int) {
val frames = bytes / channelCount / bytesPerSample
writtenFrames += frames
}
fun positionUs(): Long {
return writtenFrames * 1_000_000L / sampleRate
}
}
Note: This is an educational abstraction. Production implementations must account for internal hardware buffer latencies or query AudioTrack.getTimestamp for true hardware playhead positions.
The Native Clock Object
Never poll system time arbitrarily across worker threads. Centralize it within a thread-safe ClockSync entity.
class ClockSync {
public:
void updateAudioClockUs(int64_t audioUs) {
std::lock_guard<std::mutex> lock(mutex_);
audioClockUs_ = audioUs;
}
int64_t audioClockUs() const {
std::lock_guard<std::mutex> lock(mutex_);
return audioClockUs_;
}
int64_t videoDelayUs(int64_t videoPtsUs) const {
return videoPtsUs - audioClockUs();
}
private:
mutable std::mutex mutex_;
int64_t audioClockUs_ = 0;
};
If videoDelayUs > 0: The video frame has arrived early.
If videoDelayUs < 0: The video frame is late.
The Video Scheduling Algorithm
A minimal sync engine evaluates frames against three rigid thresholds.
Severely Early: Sleep and block the thread.
On Time: Release to Surface immediately.
Severely Late: Drop the frame silently.
enum class FrameDecision {
Wait,
Render,
Drop,
};
FrameDecision decideFrame(int64_t delayUs) {
// 20ms early allowance
constexpr int64_t kEarlyThresholdUs = 20 * 1000;
// 50ms late allowance
constexpr int64_t kLateThresholdUs = -50 * 1000;
if (delayUs > kEarlyThresholdUs) {
return FrameDecision::Wait;
}
if (delayUs < kLateThresholdUs) {
return FrameDecision::Drop;
}
return FrameDecision::Render;
}
These thresholds are not static laws. Tuning is required based on target frame rates (30fps vs 60fps), decoder speeds, and rendering paths. However, the architectural constraint stands: The decision logic must remain isolated in a unified function for telemetry tracking.
The Render Loop
void renderFrame(AMediaCodec* codec, size_t outputIndex, int64_t ptsUs, ClockSync* clock) {
while (true) {
int64_t delayUs = clock->videoDelayUs(ptsUs);
FrameDecision decision = decideFrame(delayUs);
if (decision == FrameDecision::Wait) {
// Sleep for half the delay to prevent overshooting the target.
std::this_thread::sleep_for(std::chrono::microseconds(delayUs / 2));
continue;
}
if (decision == FrameDecision::Drop) {
// Drop frame silently.
AMediaCodec_releaseOutputBuffer(codec, outputIndex, false);
return;
}
// Render frame to Surface.
AMediaCodec_releaseOutputBuffer(codec, outputIndex, true);
return;
}
}
Sleeping for delayUs / 2 is a standard heuristic to counteract OS thread scheduling inaccuracies. High-performance systems implement more aggressive spin-wait or VSYNC-aligned schedulers.
Why Sleep is Insufficient
Calling std::this_thread::sleep_for is not a precision sniper rifle. The Android OS scheduler guarantees you will not wake up earlier than requested, but it makes zero guarantees about how late you might wake up.
Therefore, AVSync cannot rely solely on pausing execution. It must aggressively triangulate using the triad: Wait (Sleep), Render (Pass), Drop (Execute).
The Video-Only Fallback Clock
Media lacking an audio track cannot rely on an Audio Master Clock. The engine must seamlessly fallback to a strictly monotonic system clock.
class VideoOnlyClock {
public:
void startAt(int64_t firstPtsUs) {
firstPtsUs_ = firstPtsUs;
startTime_ = std::chrono::steady_clock::now();
}
int64_t nowUs() const {
auto elapsed = std::chrono::steady_clock::now() - startTime_;
return firstPtsUs_ + std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
}
private:
int64_t firstPtsUs_ = 0;
std::chrono::steady_clock::time_point startTime_;
};
Telemetry Triumphs Intuition
Print the synchronization telemetry matrix every second.
audioUs=10333000 videoPtsUs=10366000 driftUs=33000 decision=wait
audioUs=10400000 videoPtsUs=10366000 driftUs=-34000 decision=render
audioUs=10500000 videoPtsUs=10400000 driftUs=-100000 decision=drop
These logs are absolute proof of root causes:
Continuous WAIT: Decoder is outpacing the Audio Clock, or the Audio Clock has stalled.
Continuous DROP: Decoder is thrashing, CPU starved, or Surface is blocked.
Violent Drift Oscillations: Audio Clock logic is erratic or OS scheduling is aggressively throttling the thread.
Laboratory Verification
Load an MP4 containing 30fps video and AAC audio.
- First, disable the Drop logic. Force all frames to Render. Observe the drift escalate.
- Enable the Drop logic. Verify visual stutter (when late) but ensure the drift mathematically converges to zero.
- Rapidly foreground/background the app. Ensure the audio clock does not erroneously reset to
0, permanently destroying synchronization.
Track the following metrics explicitly:
driftUs Mean Average
driftUs P95
dropCount
firstAudioUs
firstVideoPtsUs
Engineering Risks and Observability
A failed AVSync loop rarely generates a Tombstone crash, but it violently degrades the user experience.
Telemetry endpoints:
av_drift_us
drop_frame_count
wait_frame_count
audio_clock_source
video_only_clock_enabled
If av_drift_us is permanently positive, video is arriving early. Suspect audio clock stalling or an overly aggressive sleep loop.
If av_drift_us is permanently negative, video is chronically late. Suspect hardware decoder bottlenecks, I/O latency, or CPU starvation.
If drop_frame_count spikes exponentially, hook up simpleperf immediately to identify the blocking function call.
AVSync must incorporate graceful degradation pathways:
If the Audio device dies -> Failover to System Clock.
If OS scheduling latency exceeds 100ms -> Temporarily widen drop thresholds to prevent cascading drops.
On return from Background state -> Forcibly reset the Master Clock baseline against the current PTS.
Conclusion
AVSync is not an API call; it is a rigid mathematical discipline. The audio track establishes the absolute timeline, and the video pipeline aggressively conforms—waiting, rendering, or discarding data as commanded by the time differential. Only when these two asynchronous streams are subjugated to the Master Clock does a decoder transform into a true Player.