Autoregressive LLM inference on mobile devices suffers from sequential token generation: each token requires a full forward pass through the model. With 7B parameter models now shipping on flagship phones, users expect sub-500ms time-to-first-token and smooth streaming. Speculative decoding—where a small draft model proposes multiple tokens that a larger verifier accepts or rejects in parallel—delivers 2-3× speedup with zero accuracy degradation. Here's how to implement it for production mobile apps.

The Sequential Bottleneck

Standard autoregressive sampling generates one token per inference pass. For a Llama-2-7B quantized to 4-bit on an iPhone 15 Pro, each token costs roughly 45ms of NPU time. A 50-token response takes 2.25 seconds of pure compute, ignoring memory bandwidth stalls. The problem: each token depends on all previous tokens through the KV cache, creating a hard sequential dependency.

Batch processing doesn't help for single-user chat—you can't parallelize a sequence that hasn't been generated yet. Speculative decoding breaks this by guessing multiple future tokens, then verifying them in one pass.

Draft-Then-Verify Architecture

The pattern uses two models: a fast draft model (typically 1-2B parameters, quantized to 4-bit) and the target model (7B+). The draft model generates k candidate tokens autoregressively. The target model then evaluates all k candidates in a single forward pass using parallel attention, accepting a prefix of correct tokens and rejecting the rest.

Key insight: verifying k tokens in parallel is cheaper than generating them sequentially if the draft model is 3-4× faster than the target. Acceptance rate determines net speedup—if 60% of draft tokens are accepted, you're generating 0.6k tokens per target model call instead of 1.

Token Acceptance Logic

For each draft token t_i, compare the target model's probability distribution P_target against the draft's P_draft. Accept if P_target(t_i) >= P_draft(t_i) for all tokens up to i. This ensures the target distribution could have plausibly generated the sequence. On rejection, sample from the residual distribution max(0, P_target - P_draft) to maintain output quality.

In practice, temperature and top-p sampling complicate this. For mobile, we quantize both distributions to 8-bit and use SIMD comparisons on ARM NEON. Rejection sampling from the residual adds 2-3ms overhead but preserves statistical properties.

Mobile-Specific Optimizations

Shipping this on iOS and Android requires addressing memory pressure, thermal throttling, and heterogeneous compute.

Unified KV Cache Management

Both models share a combined KV cache in Metal or Vulkan shared memory. The draft model writes keys/values for k speculative tokens; the target model reads them for parallel verification. On rejection at position j, we truncate the cache and continue from the last accepted token. Cache size peaks at (context_len + k) × num_layers × hidden_dim × 2 × sizeof(fp16). For a 7B model with 4096 context and k=5, that's roughly 180MB for the target cache plus 45MB for the draft.

Memory-map both caches to avoid allocation churn. On iOS, use MTLResourceStorageModeShared for zero-copy access between CPU and GPU. Prefault pages during model load to avoid first-token jank.

Draft Model Selection

The draft model must be architecturally similar (same tokenizer, similar attention patterns) but smaller. Distilled models work well—Llama-2-1.5B distilled from 7B achieves 68% token acceptance on conversational prompts. Alternatively, prune the target model to 25% parameters using magnitude pruning and fine-tune on the same dataset.

For on-device use, quantize the draft to 4-bit grouped quantization with 32-element blocks. This fits a 1.5B model in ~750MB, leaving headroom for the target model and OS overhead on 8GB devices. Test acceptance rates after quantization—aggressive rounding can drop accuracy and hurt net speedup.

Thermal and Power Management

Speculative decoding increases instantaneous power draw (two models active) but reduces total energy per response. On Snapdragon 8 Gen 2, a 50-token response drops from 2.1s at 4.8W to 0.9s at 6.2W—net energy falls from 10.1J to 5.6J. The shorter burst keeps SoC temperature 3-4°C lower, delaying thermal throttling.

Implement adaptive k: start with k=6 when cool, drop to k=3 above 42°C junction temp. Monitor via ProcessInfo.thermalState on iOS or /sys/class/thermal/ on Android. Lower k reduces draft model overhead when the target is already throttled.

Acceptance Rate Engineering

Acceptance rate determines whether you gain 1.5× or 3× speedup. Empirical results from shipping apps:

  • Conversational chat (Llama-2-7B + 1.5B draft): 64% acceptance, 2.3× end-to-end speedup
  • Code completion (CodeLlama-7B + 1B draft): 71% acceptance, 2.6× speedup
  • Summarization (Mistral-7B + 2B draft): 52% acceptance, 1.8× speedup

Code and structured outputs have higher acceptance because syntax constrains the token space. Free-form creative writing is harder—draft models struggle with stylistic nuance. For medical or legal apps where accuracy is critical, tune k conservatively (3-4) to avoid rejection overhead.

Dynamic k Tuning

Adjust k based on running acceptance rate. Track accepts/rejects over a 10-token window. If rate drops below 50%, decrement k; if above 75%, increment. This adapts to prompt difficulty—simple Q&A benefits from k=8, complex reasoning from k=3.

Store per-conversation state: some users ask predictable questions (high acceptance), others are adversarial (low). Persist k in the chat session metadata.

Implementation Sketch

Here's the core loop in Swift for iOS with CoreML models:

func speculativeGenerate(
  draft: MLModel,
  target: MLModel,
  prompt: [Int],
  k: Int,
  maxTokens: Int
) -> [Int] {
  var tokens = prompt
  var kvCache = KVCache()
  
  while tokens.count < maxTokens {
    // Draft k tokens
    let draftTokens = (0..= draftProb {
        accepted += 1
      } else { break }
    }
    
    // Commit accepted, sample residual on reject
    tokens += draftTokens[..