NDK Module Production Readiness Checklist: From 'It Compiles' to 'Ship It'
The chasm between an NDK module that "technically runs" and one that is "cleared for production" is vast. A native crash is catastrophic—it annihilates the entire process. Furthermore, real-time modules like video players are entangled in complex, volatile subsystems: multi-threading, Surface lifecycles, hardware codecs, direct memory, AV clocks, and third-party dependencies.
This document provides the definitive pre-release engineering checklist. It is not bureaucratic theater; it is the distillation of the architecture principles we have established, codified into an absolute release gateway.
Hardware and API Compatibility
The `minSdk` specification aligns exactly with `ANDROID_PLATFORM`, or deliberate degradation paths exist.
Every supported target ABI has a corresponding `.so` binary packaged in the final artifact.
The `arm64-v8a` binary has not been accidentally omitted.
Modern NDK API calls are guarded by runtime capability probes or fallback implementations.
All native libraries mathematically pass the Android 15 16KB page size alignment check.
All third-party `.so` files are fully documented (Source, Version, ABI, License).
The most frequent novice error is ignoring embedded .so payloads hidden inside third-party SDKs. If a third-party binary ships inside your APK, its failure is your failure.
The JNI Boundary Contract
JNI interface methods are strictly isolated to a minimal set of facade bridge files.
High-frequency per-frame operations NEVER traverse the JNI boundary.
`JNIEnv*` pointers are NEVER cached across thread boundaries.
Kotlin/Java objects passed to background C++ threads are strictly elevated via `GlobalRef`.
Every `NewGlobalRef` is provably matched with a deterministic `DeleteGlobalRef`.
`JNI_OnLoad` initialization failures emit definitive, actionable error logs.
`RegisterNatives` signatures are verified by automated test coverage.
The JNI boundary must remain an extremely narrow, heavily guarded checkpoint. A wide, porous boundary guarantees synchronization failures.
Memory and Concurrency Integrity
All raw C API resources are encapsulated within RAII smart pointers or scope guards.
`AMediaExtractor`, `AMediaCodec`, and `ANativeWindow` allocations and destructions are perfectly paired.
The `release` operation is mathematically idempotent (safe to call multiple times).
The Render Thread NEVER attempts to draw to an `ANativeWindow` after it has been destroyed.
Background thread termination protocols are explicit and deterministic.
All internal command and packet queues enforce strict capacity boundaries (Backpressure).
Seek operations atomically flush both the packet queues and the hardware codec.
At a minimum, the module must survive the following barrage before release:
Rapidly entering and exiting the playback UI context.
Spamming extreme seek operations.
Violent device rotation (Surface teardown and recreation).
Rapid backgrounding and foregrounding of the application.
Injection of corrupted packets or simulated network latency.
Telemetry and Crash Observability
Unstripped `.so` symbol files are successfully archived for every release candidate.
`ndk-stack` is proven capable of symbolicating simulated synthetic crashes.
Native logs emit highly dimensional context: State, Serial ID, Thread ID, and Source Component.
Error codes explicitly bifurcate into Recoverable and Fatal taxonomies.
The production crash reporting platform can aggregate failures by Target ABI and OS Version.
If you cannot archive symbols, you cannot ship the native module. This is an absolute, non-negotiable hard stop.
Performance Validation
Time-to-First-Frame (TTFF) operates within established latency baselines.
Seek-Recovery latency operates within established baselines.
Continuous 30-minute playback yields zero memory leaks or queue ballooning.
`simpleperf` profiling confirms the absence of pathological CPU hotspots.
Post-optimization performance deltas are mathematically documented.
JNI callback frequencies are tightly throttled.
Logging operations do NOT execute inside the primary 60Hz render loop.
Performance metrics must evaluate P95 and P99 tail latencies. In media engineering, the user evaluates your player entirely by the frequency of severe stutters, not by the statistical average frame time.
Media Pipeline Integrity
`AMediaExtractor` accurately selects and activates the correct media tracks.
Sample Presentation Time Stamps (`ptsUs`) generally monotonically increase.
Every output buffer acquired from `AMediaCodec` is deterministically released.
Dynamic resolution shifts (`INFO_OUTPUT_FORMAT_CHANGED`) are handled seamlessly.
End-Of-Stream (EOS) signals are coordinated across both Audio and Video tracks.
AVSync drift metrics remain within acceptable sub-millisecond thresholds.
If Audio hardware is unavailable, the pipeline falls back to an independent System Clock.
Pay explicit attention to AMediaCodec_releaseOutputBuffer. If an output buffer is not explicitly released back to the hardware, the codec queue permanently stalls, completely freezing video playback.
The Triage Matrix: Blockers vs Observations
Audit findings must be triaged into three strict classifications:
Blocking Failure: Deployment is halted until the defect is resolved.
Observational Risk: Deployment may proceed (via staged rollout), but requires explicit telemetry monitoring.
Clearance: The module meets or exceeds the production standard.
Examples of Blocking Failures:
The `arm64-v8a` binary is missing from the artifact.
The CI system fails to archive unstripped symbol files.
The 16KB page alignment verification fails.
A background thread refuses to terminate following a `release` command.
The Render Thread crashes upon Surface destruction.
Examples of Observational Risks:
P95 Seek Latency spikes slightly on a specific subset of low-tier hardware.
A third-party SDK exports a massive symbol table, though no immediate conflicts exist.
GWP-ASan memory sampling reports an isolated, unconfirmed anomaly.
The Post-Deployment 7-Day Protocol
Deployment is the beginning of the engineering lifecycle, not the end.
Monitor the absolute Native Crash Rate.
Monitor the Application Not Responding (ANR) Rate.
Track P95 Time-to-First-Frame telemetry.
Track P95 Seek Recovery telemetry.
Track the frequency of Fatal Error state transitions.
Track the frequency of Buffer Starvation events.
Analyze the distribution of ABIs and Device Models.
Correlate any crash spikes against newly integrated third-party SDKs.
Native modules require aggressive phased rollouts (canary deployments) and the engineering discipline to execute an immediate rollback when telemetry dictates.
Phased Utilization for Initiates
Do not interpret this checklist as an insurmountable monolith. Apply it chronologically throughout the development cycle.
Development Phase: Focus aggressively on JNI boundaries, RAII resource management, and Thread termination.
Integration Phase: Focus on State Machine integrity, Seek mechanics, and Surface lifecycle handling.
Testing Phase: Deploy Sanitizers (ASan/HWASan), execute `simpleperf` baselines, and audit hardware compatibility.
Release Phase: Verify ABI targets, symbol archival, 16KB alignment, and third-party SDK licensing.
Deployment Phase: Monitor crash rates, P95 tail latencies, and define hard rollback conditions.
At every phase, resolve the most critical unknown risk.
The Minimum Viable Release Gateway
If a team is deploying their inaugural NDK module, enforce this absolute minimum gateway:
Compilation succeeds for both `arm64-v8a` and `x86_64`.
Smoke tests execute without failure.
The Player handles Create, Play, Pause, Seek, and Destroy flawlessly.
Thread destruction is proven deterministic upon `release`.
`ndk-stack` successfully symbolicates a synthetic crash.
All `.so` binaries pass 16KB page alignment validation.
Unstripped symbol files are archived.
As the engineering maturity scales, introduce Sanitizer coverage, simpleperf regressions, and strict dependency auditing.
The Rollback Contract
Native modules demand predefined, immutable rollback parameters.
Rollback if the Native Crash Rate exceeds the historical baseline by 2x.
Rollback if crash telemetry aggressively spikes on a single target ABI.
Rollback if P95 Time-to-First-Frame degrades by >30%.
Rollback if Seek-induced Fatal Errors breach the acceptable threshold.
Rollback if third-party SDK initialization failures cascade.
A rollback is not a systemic failure; it is the correct execution of a defensive engineering protocol. The prerequisite to shipping code is the absolute certainty of knowing exactly when to pull it back.
Accountability and Ownership
Checklists require defined ownership to function.
Build Engineer: Accountable for ABIs, Symbol Archival, and 16KB Alignment.
Client Architecture Engineer: Accountable for JNI, Lifecycles, and State Machine integrity.
QA/Test Engineer: Accountable for Stress Testing, Compatibility, and Regression matrices.
Release Engineer: Accountable for Phased Rollout, Telemetry Observation, and Rollback execution.
Explicit accountability prevents diagnostic paralysis when a P0 incident occurs.
Conclusion
Production readiness for an NDK module is not defined by "does it play a video." It is defined by absolute dominance over compatibility, JNI security, memory physics, crash observability, pipeline efficiency, and deterministic delivery. A rigorous checklist converts chaotic deployment into a mathematically verifiable engineering process.