The Problem: Autoregressive Bottleneck
Every token generated by a transformer language model depends on all previous tokens. This sequential dependency makes parallel execution impossible during inference—you must wait for token t before computing token t+1. On mobile devices with limited memory bandwidth and thermal constraints, this autoregressive bottleneck becomes painful: a 7B parameter model on iPhone 15 Pro generates roughly 12 tokens per second, spending 83ms per token just waiting for memory transfers.
Standard optimizations—quantization, KV cache reuse, Flash Attention—address computation and memory, but none break the fundamental serial chain. Speculative execution does.
How Speculative Decoding Works
The core insight: use a small, fast draft model to predict multiple tokens in parallel, then verify them with the full target model in a single forward pass. If the draft guesses correctly, you've generated multiple tokens at the cost of one target-model invocation. If not, you discard incorrect suffixes and continue from the last valid token.
The Draft-Verify Cycle
For each iteration:
- Draft model autoregressively generates k candidate tokens (typically 4-6)
- Target model processes all candidates in one batched forward pass
- Compare draft and target logits at each position
- Accept tokens while probabilities align within threshold τ
- Resample the first divergent position from target distribution
- Discard remaining draft tokens, repeat
The target model sees the entire draft sequence as input, enabling parallel verification. A 160M draft model generating 5 guesses takes ~15ms; verifying them with the 7B target takes 85ms—but yields 3-4 valid tokens on average, amortizing the cost.
Choosing the Draft Model
The draft model must be architecturally compatible (same vocabulary, similar distribution) but dramatically smaller. Three practical approaches:
Distilled sibling: Train a 2-4 layer transformer on the target model's output distribution. Works well when you control both models. In production LLM apps like OfflineAI, we distilled Llama-7B into a 160M draft model with 6.2 perplexity gap—small enough that 68% of 5-token drafts had ≥3 accepted.
Early-exit layers: Use the target model's first N layers as the draft. Zero additional parameters, but requires architectural support for intermediate exits. Adds 12-18ms overhead for logit computation at layer 8 of a 32-layer model.
Quantized target: Run a 4-bit quantized version as draft, full-precision as target. Simpler pipeline, but draft quality suffers—acceptance rate drops to ~45% in our tests with Llama-2-7B.
Memory Overhead
The draft model's weights add 160-400MB depending on size. KV cache for draft tokens is negligible (5 tokens × 4096 dims × 2 bytes × 32 layers = 1.3MB). Total overhead: ~200MB for a distilled draft, easily fitting in the memory budget of a 4GB+ device.
Acceptance Rate and Alignment
Not all draft tokens survive verification. Acceptance depends on distribution alignment between draft and target. We use normalized probability difference:
accept_i = |p_draft(token_i) - p_target(token_i)| < τ
Setting τ = 0.15 balances acceptance rate and output quality. Lower thresholds preserve target distribution but accept fewer tokens; higher thresholds boost speed at the cost of subtle drift.
Empirical results from OfflineAI's on-device chat (iPhone 14 Pro, Llama-7B target, 160M draft, 5-token lookahead):
- Average acceptance: 3.2 tokens per cycle
- Wall-clock speedup: 2.1× (12 tok/s → 25 tok/s)
- 95th percentile latency: 140ms (down from 290ms)
- Perplexity increase: 0.03 (negligible)
Implementation Details for Mobile
Metal Shader Batching
Verifying 5 draft tokens requires a batched matmul with sequence length 6 (5 drafts + 1 context). Standard Metal Performance Shaders handle this, but custom kernels reduce overhead. We fused the embedding lookup and first attention layer into a single dispatch, cutting 8ms from the critical path.
Thermal Management
Speculative execution increases GPU utilization from 65% to 88%, raising SoC temperature by 4-6°C. On devices with aggressive thermal throttling (iPhone 13 series), this triggers frequency scaling after 90 seconds of continuous generation. Mitigation: adaptive lookahead—drop from 5 tokens to 3 when core temp exceeds 42°C, maintaining 1.6× speedup without throttling.
KV Cache Reuse
Accepted draft tokens must update the target model's KV cache. Since the target model processed them in parallel, their keys and values are already computed—store them directly. For rejected tokens, truncate the cache at the divergence point. Cache management adds ~2ms per cycle but eliminates redundant computation.
When Speculative Execution Fails
Not all workloads benefit equally:
Creative generation: High temperature (>0.9) and top-p sampling produce diverse outputs. Draft model alignment degrades—acceptance drops to 1.8 tokens per cycle, yielding only 1.3× speedup. Speculative execution shines at temperature ≤0.7.
Code completion: Extremely predictable. Acceptance rates hit 4.6 tokens per cycle with a well-tuned draft, achieving 2.8× speedup. This is where the technique was born—Google's 2023 paper benchmarked on code tasks.
Multilingual models: Draft models trained primarily on English struggle with low-resource languages. Arabic text in our tests saw acceptance drop to 2.1 tokens—still a 1.5× win, but less dramatic.
Production Considerations
Deploying speculative execution in a mobile LLM app requires careful tuning:
Lookahead length: Start at 4 tokens. Longer drafts increase memory bandwidth and reduce acceptance rate. Profile your specific model pair—sweet spot is usually 4-6.
Fallback logic: If three consecutive cycles accept ≤1 token, disable speculation for 50 tokens. Pathological inputs (random noise, encoding errors) waste compute on useless drafts.
Battery impact: 88% GPU utilization drains battery 1.4× faster than standard inference. For battery-sensitive contexts, expose a user toggle or auto-disable below 20% charge.
Future Directions
Cascade drafting uses multiple draft models of increasing size—a 50M model generates 8 coarse guesses, a 160M model refines 4, then the target verifies 2. Early experiments show 2.6× speedup but require 3× memory for draft weights.
Tree-structured speculation explores multiple token branches in parallel, accepting the longest valid path. Increases acceptance to 4.1 tokens but requires custom attention masking—non-trivial on mobile GPUs.
Adaptive draft switching selects draft models based on input domain (code vs prose vs chat) using a 5ms classifier. Boosts average acceptance by 0.4 tokens across mixed workloads.
Measuring Success
Track these metrics in production:
- Acceptance rate: tokens accepted per cycle (target: >3.0)
- Wall-clock speedup: end-to-end latency improvement (target: >1.8×)
- Distribution drift: KL divergence between speculative and standard output (target: 120s)
Speculative execution is not a silver bullet—it trades memory and power for latency. But for interactive mobile LLM apps where every 50ms matters, doubling throughput with negligible quality loss is transformative. The technique works today on any device with 4GB+ RAM and a Metal or Vulkan GPU, using off-the-shelf draft models or quick distillation runs.