The Autoregressive Bottleneck
Modern mobile LLMs—Llama 3.2, Gemini Nano, Phi-3—generate text one token at a time. Each forward pass through a 1B-parameter model takes 40–80ms on an A17 Pro, meaning a 50-token response requires 2–4 seconds of pure inference time. The problem is architectural: autoregressive decoding is inherently sequential. You cannot compute token n+1 until you have token n.
Speculative decoding breaks this constraint. The core insight: use a small, fast draft model to predict multiple tokens in parallel, then verify them in a single forward pass of the larger target model. When the draft is correct—which happens 60–80% of the time for coherent text—you get 3–5 tokens per target model invocation instead of one. Latency drops by 40–60% with zero quality degradation.
Draft-Verify Architecture
The pipeline has three stages. First, the draft model—typically 100–300M parameters, quantized to 4-bit—generates k candidate tokens autoregressively. Common values: k=4 for chat, k=8 for code completion. This takes 10–20ms total on mobile hardware.
Second, you construct a batch containing all k draft tokens and run a single forward pass through the target model. The target model's logits at position i are compared to the draft token at i. If they match (or the target's top-5 contains the draft token), you accept it and continue. If not, you reject the draft token, sample from the target's distribution, and discard all subsequent drafts.
Third, you append the accepted tokens to the context and repeat. The key optimization: batching the verification pass amortizes the KV cache overhead. Instead of k sequential target model calls, you make one batched call with k sequence positions.
KV Cache Management
The KV cache is the memory bottleneck. For a 1B-parameter model with 16 layers, 16 heads, and head dimension 64, each token's KV cache entry is 2 × 16 × 16 × 64 × 2 bytes = 64KB (FP16). A 2048-token context requires 128MB just for the cache.
Speculative decoding adds draft model cache (10–20MB) plus temporary target cache for the k candidate positions. The trick: allocate a circular buffer sized for context_len + k and overwrite rejected drafts in-place. On iOS, use MTLHeap with explicit placement; on Android, manage ByteBuffer pools manually or use ONNX Runtime's memory arena.
During verification, you only compute attention over the draft tokens' positions. The rest of the context is already cached. This is why batching matters: computing k positions in one pass is ~2× faster than k separate passes due to memory bandwidth and kernel launch overhead.
Draft Model Training
You have three options. First, distill the target model into a smaller architecture using the target's logits as soft labels. This gives high alignment but requires retraining. Second, use an existing small model from the same family (e.g., Llama 3.2 1B drafting for Llama 3.2 3B). Alignment is lower but you avoid training. Third, fine-tune a small model on the target's output distribution using KL divergence loss.
In production healthcare apps—where Omar has shipped on-device LLMs for clinical note generation—option two works well. The 1B model runs at 12ms/token on iPhone 15 Pro, drafts four tokens in 48ms, and the 3B verification pass takes 60ms. Net result: 108ms for ~3 accepted tokens versus 180ms for three sequential target calls. That's 40% faster with no quality loss.
Acceptance Rate Tuning
Acceptance rate depends on draft quality and the verification threshold. Strict threshold: accept only if argmax(target_logits) == draft_token. Loose threshold: accept if draft token is in target's top-p cumulative probability mass. Typical p=0.9 boosts acceptance from 55% to 75% with minimal quality impact.
Monitor acceptance rate per request. If it drops below 40%, speculative decoding adds overhead—you're running two models for little gain. Fallback to standard decoding. In practice, acceptance correlates with prompt domain: high for in-distribution text (medical notes, code), low for creative fiction or multilingual mixing.
Mobile Implementation Details
On iOS, use ONNX Runtime with CoreML execution provider for the target model and CPU execution for the draft. The draft model's small size makes CPU inference faster than the CoreML delegate's overhead. Load both models at app launch; cold start is ~800ms for the pair.
Key code structure: maintain two OrtSession instances. For each decode iteration, run the draft model in a loop accumulating tokens into a std::vector. Then construct an ONNX Value tensor with shape [1, k] containing draft tokens, run target model inference, and compare logits using vDSP_maxvi for SIMD argmax.
On Android, use ONNX Runtime's NNAPI delegate for the target model if the device supports it (check NnApiDelegate.isAvailable()). Otherwise, fall back to XNNPACK. The draft model runs on CPU with XNNPACK. Allocate tensors in direct ByteBuffer instances to avoid JNI copy overhead.
Thread Management
Run draft generation on a background thread to avoid blocking the UI. The target verification pass should also be async, but use a higher-priority queue. In Flutter (used in multiple of Omar's shipped apps), use compute() for the draft loop and a SendPort to stream tokens back to the UI isolate. For Swift, DispatchQueue.global(qos: .userInitiated) works well.
Critical: do not run speculative decoding in a tight loop without yielding. On iOS, call sched_yield() every 3–5 iterations to let the system schedule thermal management. On Android, check PowerManager.getCurrentThermalStatus() and reduce k if the device is throttling.
Benchmarking and Profiling
Measure three metrics: wall-clock latency (ms/token including draft and verify), acceptance rate (%), and energy per token (mJ). Use Xcode Instruments' Metal System Trace for GPU profiling and Energy Log for power. On Android, adb shell dumpsys batterystats gives per-app energy; use Android GPU Inspector for kernel timing.
Typical numbers on iPhone 15 Pro with Llama 3.2 1B→3B: standard decoding averages 62ms/token, speculative decoding averages 38ms/token (39% faster), acceptance rate 68%, energy cost increases 12% due to running two models. The energy trade-off is acceptable for latency-sensitive apps like real-time chat or voice assistants.
For batch inference (e.g., summarizing multiple documents), speculative decoding shines: acceptance rate rises to 75–80% because the draft model sees more in-distribution text. Latency drops to 32ms/token. Energy cost is offset by shorter total runtime.
When to Skip Speculative Decoding
Three scenarios where it hurts. First, extremely short responses (under 10 tokens): the overhead of loading and running the draft model exceeds the benefit. Use a token count heuristic to decide at runtime. Second, highly creative sampling (temperature >1.2): draft acceptance rate plummets because the target model's distribution is flatter. Third, memory-constrained devices (under 4GB RAM): the combined model footprint plus dual KV caches can trigger system kills.
In offline-first mobile apps—where Omar has implemented sync patterns for clinical and e-commerce use cases—speculative decoding pairs well with background prefetching. Preload the draft model during idle time, run speculative decoding when the user requests generation, and fall back to standard decoding if memory pressure rises.
Future Directions
Medusa decoding extends speculative decoding by training the draft model to predict multiple token trees in parallel, not just a linear sequence. This boosts acceptance rate to 85–90% but requires custom model architecture. Another approach: layer-skipping, where you run only alternating transformer layers for the draft pass. This avoids loading a separate model but acceptance rate is lower (~50%).
For mobile, the most promising direction is adaptive k: dynamically adjust the number of draft tokens based on recent acceptance rate and device thermal state. Start with k=6, drop to k=3 if acceptance falls below 50%, increase to k=8 if it exceeds 80%. This requires a simple PID controller tracking acceptance rate over a sliding window.