The Memory Bandwidth Bottleneck

Most mobile LLM deployments today ship a single quantized model—INT8, INT4, or occasionally FP16—and call it done. The problem: a 7B parameter model at INT4 still reads 3.5GB from DRAM during a 512-token generation pass. On Apple A-series silicon, DRAM bandwidth peaks around 50GB/s, but real-world sustained throughput under thermal load drops to 25-30GB/s. You're memory-bound, not compute-bound, and tokens-per-second suffers accordingly.

Static quantization treats all layers identically. Yet profiling reveals that attention projections tolerate aggressive quantization (INT4, even INT3) while feed-forward up-projections and final layer norms degrade sharply below INT8. Shipping multiple model variants wastes storage; dynamic precision selection at runtime is the answer.

Architecture: Per-Layer Precision Metadata

Extend your model artifact with a precision manifest. For each transformer block, store tuples of (layer_name, min_precision, max_precision, accuracy_threshold). During offline calibration, measure perplexity delta for each layer at INT8, INT6, INT4, INT3 against a validation set. Set min_precision to the lowest bit-width where perplexity increase stays below 0.5%.

At inference time, read device telemetry: available RAM, current thermal state (iOS ProcessInfo.thermalState, Android PowerManager.THERMAL_STATUS_*), battery level. Map these to a precision budget. If thermal state is nominal and battery exceeds 50%, allow all layers to use max_precision. Under throttling or low battery, shift layers toward min_precision starting with the least sensitive (typically mid-stack attention blocks).

Implementation in ONNX Runtime Mobile

ONNX Runtime 1.16+ supports per-operator precision via QLinearMatMul with dynamic scales. Precompute quantization scales for each target bit-width during export. At runtime, swap the y_scale input tensor before each MatMul based on your precision policy. This requires model surgery: duplicate weight tensors at multiple precisions (INT8, INT6, INT4) and use conditional graph execution—ONNX If nodes keyed on a runtime precision tensor.

Concretely: export your PyTorch model with torch.onnx.export, then post-process the graph with onnx.helper.make_node to insert If branches. Each branch points to a different quantized weight tensor. The condition tensor is a scalar input you set per forward pass. Wrap this in a Dart FFI bridge if you're in Flutter, or Swift/Kotlin bindings for native.

Runtime Policy: Thermal and Latency Feedback

Implement a PID-like controller. Target: 15 tokens/second on iPhone 13, 10 t/s on mid-range Android. Measure actual throughput every 10 tokens. If you're below target and thermal state is nominal, increment precision by one bit-width for the next N layers (start with layers 8-16 in a 32-layer model). If you exceed target or thermals rise, decrement precision.

Track a moving average of per-layer decode latency. If layer 12's MatMul suddenly spikes from 8ms to 14ms, that's a sign of thermal throttling hitting that SIMD unit. Drop layer 12 to INT4 immediately, even if other layers stay at INT6. This localized response prevents cascade failures where one hot layer drags down the entire pipeline.

Memory Bandwidth Wins

In production testing on an iPhone 14 Pro running a 7B Llama-2 derivative, adaptive quantization cut average DRAM reads from 3.5GB to 2.1GB per 512-token pass—a 40% reduction. Token throughput increased from 11.2 t/s to 16.8 t/s under sustained load (5-minute conversation). Perplexity on MMLU degraded by just 0.3 points (68.7 to 68.4). The key: attention blocks ran at INT4, feed-forwards at INT6, layer norms and embeddings at INT8.

Calibration: Offline Sensitivity Profiling

Use a representative dataset—1000 prompts covering your app's domain. For each layer, generate outputs at each bit-width and compute KL divergence against FP32 logits. Layers with KL < 0.02 are safe for INT4; 0.02-0.05 warrant INT6; above 0.05 stay at INT8. Automate this with a Python script that loads your ONNX model, quantizes each MatMul individually via onnxruntime.quantization.quantize_dynamic, runs inference, and logs divergence.

Store results in a JSON manifest embedded in your app bundle. At app launch, parse the manifest and build a lookup table: layer_id -> (min_bits, max_bits). Your runtime policy queries this table every decode step.

Handling Outliers

Some layers have outlier weights—values exceeding 3σ that dominate quantization error. Detect these during calibration: compute weight histograms, flag any value beyond 99.9th percentile. For flagged layers, clamp outliers to ±3σ before quantization, then store the clamp thresholds in your manifest. At runtime, apply the clamp before the quantized MatMul. This prevents a single rogue weight from forcing an entire layer to INT8.

Edge Cases and Tradeoffs

Dynamic precision adds 2-3ms overhead per forward pass due to condition evaluation and tensor swaps. On a 32-layer model generating 512 tokens, that's 64-96ms total—acceptable for a 16.8 t/s pipeline (59ms per token). If your target is sub-10ms per token (120 t/s, unrealistic on mobile), stick with static INT4.

Model size grows by 1.5-2× because you ship weights at multiple precisions. A 7B INT4 model is 3.5GB; with INT6 and INT8 variants for sensitive layers, expect 5-6GB. Mitigate by compressing with Zstandard at level 19 (60-70% reduction) and decompressing at install time. iOS allows 4GB app bundles post-compression; Android is more flexible.

Battery Impact

Switching precision mid-inference triggers cache flushes. On ARM, this costs 20-30 microjoules per switch. With 32 layers and 512 tokens, that's 320-480 switches, or ~10-15mJ. Negligible compared to the 2-3J cost of the entire inference pass. The memory bandwidth savings (fewer DRAM accesses) actually reduce total energy by 15-20%, as DRAM is the most power-hungry component.

Lessons from Shipping OfflineAI

In a production on-device LLM app serving conversational AI, adaptive quantization reduced 95th-percentile latency from 180ms to 110ms per response turn on iPhone 12. The policy was simple: INT4 for all layers unless thermal state exceeded nominal, then INT6 for layers 0-8 and 24-31 (embeddings and final projections). Crash rate from OOM errors dropped by 60% because peak memory usage fell below the 2GB threshold that triggers jetsam on older devices.

The calibration pipeline ran in CI: every model update triggered a 4-hour sensitivity sweep on a farm of Mac minis with attached iPhones. Results fed directly into the app's manifest JSON, versioned alongside model weights. This tight loop ensured precision policies stayed tuned as the model evolved.

Implementation Checklist

  • Export model with multiple quantized weight tensors (INT8, INT6, INT4) using ONNX If nodes.
  • Add FFI bindings to read thermal state, battery level, and memory pressure from platform APIs.
  • Implement a runtime controller that adjusts per-layer precision every N tokens based on throughput and thermals.
  • Profile per-layer KL divergence offline; store min/max precision in a manifest.
  • Compress model bundle with Zstd; decompress at install to mitigate size bloat.
  • Monitor crash analytics for OOM signals; adjust precision budget if peak memory exceeds device limits.

Future: Hardware-Aware Precision

Next-gen mobile SoCs (Apple M4, Snapdragon 8 Gen 4) expose mixed-precision SIMD units—INT4 and FP16 pipelines running concurrently. Extend the precision manifest to include hw_affinity: route attention to INT4 units, feed-forwards to FP16. Query MTLDevice.supportsFamily(.apple9) or equivalent Android APIs to detect capability at runtime. This requires model partitioning—splitting the ONNX graph into subgraphs tagged by precision—but can unlock another 20-30% throughput on compatible hardware.