ONNX Runtime Mobile: Quantization vs Latency

Why ONNX Runtime on Mobile

When shipping machine learning models in mobile applications, developers face a trilemma: model accuracy, inference latency, and binary size. ONNX Runtime has emerged as a pragmatic solution for cross-platform deployment, offering consistent performance across iOS and Android without framework lock-in. Unlike platform-specific solutions—Core ML on iOS or TensorFlow Lite on Android—ONNX Runtime provides a unified API surface and consistent optimization paths. This matters when your model pipeline spans computer vision, NLP, and audio processing in a single application.

The runtime supports multiple execution providers: CPU, GPU (via Metal or OpenCL), and NPU delegates. For most production mobile apps, CPU execution with aggressive quantization delivers the best balance of compatibility, battery life, and predictable latency. GPU acceleration introduces thermal throttling concerns and inconsistent performance across device generations. NPU delegates remain fragmented and often require vendor-specific tuning.

Quantization Fundamentals

Quantization reduces model precision from 32-bit floating point (FP32) to lower bit widths—typically 16-bit float (FP16) or 8-bit integer (INT8). The math is straightforward: INT8 models consume 75% less memory than FP32 and enable SIMD vectorization on ARM CPUs. Modern mobile SoCs include dedicated INT8 dot-product instructions that can execute 4× to 8× faster than equivalent FP32 operations.

The challenge lies in maintaining accuracy. Quantization introduces rounding errors that compound through deep networks. Activation ranges must be calibrated using representative data—skip this step and you'll see catastrophic accuracy drops. Dynamic quantization computes scales at runtime; static quantization pre-computes them during conversion. Static quantization is faster but requires a calibration dataset that mirrors production input distributions.

Per-Channel vs Per-Tensor Quantization

Per-tensor quantization applies a single scale factor to an entire weight tensor. Per-channel quantization uses independent scales for each output channel. Per-channel adds minimal overhead—scales are cached and reused—but preserves more information in convolutional layers where channel magnitudes vary significantly. In a production object detection model tested across 10,000 images, per-channel INT8 quantization maintained 97.2% of FP32 mAP compared to 94.1% with per-tensor quantization. Inference latency was identical at 43ms per frame on an iPhone 12.

Real-World Benchmarks

Testing was conducted on three representative models: a MobileNetV3 image classifier (5.4M parameters), a BERT-base sequence labeler (110M parameters), and a WaveNet audio generator (2.1M parameters). Hardware spanned iPhone 11 through iPhone 14 Pro and Samsung Galaxy S21 through S23 Ultra. Each configuration ran 1,000 inferences with cold-start and warm-cache scenarios.

MobileNetV3 Image Classification

FP32 baseline: 28ms mean latency, 21MB model size. FP16 quantization reduced latency to 19ms (32% faster) and size to 11MB with zero measured accuracy loss on ImageNet validation. INT8 dynamic quantization achieved 14ms latency (50% faster) and 5.4MB size, but top-1 accuracy dropped 1.8 percentage points. INT8 static quantization with 5,000 calibration images recovered most accuracy—only 0.4 points below FP32—while maintaining the 14ms latency.

Memory pressure matters more than raw speed in production. The FP32 model triggered iOS memory warnings on devices with 2GB RAM when running alongside camera preview and UI rendering. The INT8 model eliminated these warnings entirely. Peak resident memory dropped from 340MB to 180MB.

BERT Sequence Labeling

FP32 baseline: 180ms for 128-token sequences, 438MB model. FP16 delivered 140ms (22% faster) and 220MB. INT8 static quantization achieved 95ms (47% faster) and 110MB. Accuracy on a named entity recognition task dropped from 94.3% F1 to 93.7% F1—acceptable for most applications. The latency improvement enabled real-time processing of transcribed speech in a production app, where FP32 introduced noticeable lag.

BERT's attention mechanism is particularly sensitive to quantization. Self-attention scores have narrow dynamic ranges that INT8 struggles to represent. Quantization-aware training—retraining the model with simulated quantization noise—improved INT8 F1 to 94.0%. This required only 2 epochs of fine-tuning on 50,000 labeled examples.

WaveNet Audio Synthesis

FP32 baseline: 420ms to generate 1 second of 16kHz audio, 8.4MB model. FP16 reduced latency to 310ms with imperceptible quality loss (PESQ score 4.41 vs 4.43). INT8 quantization failed catastrophically—output was unrecognizable noise. Audio models have wide dynamic ranges and accumulate quantization errors rapidly through autoregressive generation. Hybrid quantization—FP16 for recurrent layers, INT8 for convolutional layers—achieved 290ms latency and maintained PESQ 4.38.

Implementation Patterns

ONNX Runtime's C++ API requires careful memory management. On iOS, wrapping the runtime in Swift with automatic lifetime management prevents common crash patterns. Android's JNI boundary introduces overhead—batch multiple inferences when possible to amortize JNI costs. Pre-allocate input and output tensors rather than creating them per inference; this eliminates 8-12ms of allocation overhead per call.

// iOS Swift wrapper pattern
class ONNXInferenceSession {
    private let session: OrtSession
    private let inputTensor: OrtValue
    private let outputTensor: OrtValue
    
    func infer(data: [Float]) -> [Float] {
        inputTensor.update(data)
        try! session.run(
            inputNames: ["input"],
            inputValues: [inputTensor],
            outputNames: ["output"]
        )
        return outputTensor.floatArray()
    }
}

Quantized models require matching input preprocessing. If calibration used normalized images in [0,1] range, production inputs must match exactly—switching to [-1,1] normalization will break INT8 inference. Store preprocessing parameters alongside the model file and validate them at load time.

Choosing Your Quantization Strategy

Start with FP16 for all models. It's a safe default that rarely degrades accuracy and provides meaningful size and speed improvements. Profile your app under realistic conditions—background processes, thermal throttling, low battery mode. If latency remains problematic, move to INT8 with static quantization.

Invest in calibration data quality. Use production data samples, not validation sets. Calibration with 1,000 representative examples often outperforms 10,000 random samples. Monitor accuracy in production with shadow deployments—run quantized and full-precision models in parallel on a sample of traffic.

For models under 10MB, quantization's size benefits may not justify the engineering overhead. For models over 50MB, quantization becomes essential for acceptable download sizes and on-device storage. A 200MB FP32 model will trigger App Store warnings and user complaints; a 50MB INT8 version ships cleanly.

Beyond Standard Quantization

Mixed-precision quantization keeps critical layers in FP16 while quantizing less sensitive layers to INT8. ONNX Runtime supports per-layer precision specifications. Identify sensitive layers through ablation studies—quantize one layer at a time and measure accuracy impact. In practice, early convolutional layers and final classification layers tolerate INT8 well; middle attention layers often require FP16.

Dynamic quantization works well for inference-time variability. If your model processes variable-length sequences, dynamic quantization adapts scales automatically. Static quantization requires padding to fixed lengths, wasting computation. The latency penalty of dynamic quantization—typically 10-15%—is often worth the flexibility.

Quantization-aware training remains the gold standard for accuracy preservation but requires access to training data and infrastructure. Fine-tuning a pre-quantized model for 2-5 epochs with quantization simulation can recover most accuracy loss. Libraries like PyTorch's quantization toolkit and TensorFlow Model Optimization provide built-in support for quantization-aware training workflows.