Autoregressive Beam Search: Mobile ASR Decoding

On-device automatic speech recognition (ASR) demands a delicate balance: models must decode audio streams in real-time while running within the tight memory and compute budgets of mobile hardware. Greedy decoding—selecting the highest-probability token at each step—offers minimal overhead but often produces suboptimal transcriptions. Beam search, the standard remedy, explores multiple hypotheses in parallel but quickly exhausts mobile RAM when naively implemented. This article dissects a production-grade autoregressive beam search tailored for streaming ASR on iOS and Android, achieving 12-18% word error rate (WER) improvement over greedy decode while staying under 40MB incremental memory.

The Greedy Baseline and Its Limits

Most mobile ASR systems start with an encoder-decoder architecture: an acoustic encoder processes Mel spectrograms into frame-level embeddings, and an autoregressive decoder emits tokens one at a time, conditioned on prior outputs. Greedy decoding simply picks argmax(softmax(logits)) at each step. For a 16kHz audio stream chunked into 20ms frames, this means ~50 decode steps per second—manageable on modern ARM processors.

The failure mode appears in ambiguous acoustic contexts. Consider the phrase "recognize speech." If early frames suggest "wreck a nice," greedy decode commits immediately; the model never backtracks. Beam search mitigates this by maintaining k parallel hypotheses (the beam width), scoring each continuation and pruning low-probability branches.

Naive Beam Search: Memory Explosion

A straightforward beam search implementation duplicates decoder state for each hypothesis. With k=10 and a 128MB ONNX model holding ~4MB of mutable KV cache per sequence, you instantly need 40MB just for state tensors—before accounting for logit buffers, score heaps, or token histories. On a device running background apps and system services, this triggers memory warnings within seconds.

The second pitfall is branching factor. At each step, each of the k beams generates V candidate continuations (where V is vocab size, typically 1,000-5,000 for subword models). Scoring all k × V candidates and selecting the top k requires either a full sort (O(kV log kV)) or a min-heap (O(kV log k)). For k=10 and V=3,000, that's 30,000 comparisons per frame—at 50 frames/second, you're burning CPU on bookkeeping rather than inference.

Width-Constrained Streaming Architecture

The production solution imposes three constraints: fixed beam width, prefix sharing, and lazy expansion.

Fixed Beam Width with Circular State Pool

Instead of allocating KV cache dynamically, pre-allocate a pool of exactly k state tensors at initialization. Each tensor is reused across frames; when a hypothesis is pruned, its slot is immediately available for a new candidate. This caps memory at k × (model state size), independent of sequence length. For k=8 and 4MB per state, total overhead is 32MB—acceptable on devices with 4GB+ RAM.

Hypothesis scores and token sequences are stored separately in a compact struct: { score: f32, tokens: Vec<u16>, state_idx: u8 }. The state_idx field points into the circular pool. When a hypothesis branches, we copy its parent's state tensor to a free slot; when pruned, we mark the slot free. A simple free-list managed with a u8 bitmask tracks availability in O(1).

Prefix Sharing via Trie

Many hypotheses share long common prefixes—especially in the first few frames when the model is uncertain. Storing full token sequences for each beam wastes memory and complicates length normalization. A prefix trie collapses shared histories: each node holds a single token and a parent pointer. Hypotheses store only a leaf node ID; reconstructing the full sequence is a backward walk.

This cuts token storage from k × T (where T is sequence length) to roughly k + T unique nodes in the trie. For a 5-second utterance (~250 tokens) with k=8, this reduces token memory from 2KB to ~520 bytes. The trie also simplifies length-normalized scoring: you can cache cumulative log-probabilities at each node.

Lazy Expansion: Top-p Filtering Before Scoring

Rather than score all kV candidates, apply nucleus sampling (top-p) to each beam's logits before expansion. Sort the vocabulary by probability, accumulate until cumulative mass exceeds p=0.95, and discard the tail. This typically reduces candidates per beam from 3,000 to 20-50. Now you're scoring 8×40=320 candidates instead of 24,000—a 75× reduction in comparisons.

The top-k selection across all beams then uses a simple min-heap of size k. For each of the ~320 candidates, if its score exceeds the heap's minimum, pop the min and push the new candidate. Total complexity: O(kp̄ log k), where p̄ is the average top-p candidate count (~40). For k=8, this is ~1,280 comparisons per frame—manageable even on mid-tier ARM cores.

Length Normalization and EOS Handling

Raw beam search favors shorter hypotheses because log-probabilities are negative and accumulate. Length normalization divides cumulative score by sequence length (or length^α for tunable α). The trie makes this efficient: each node caches its depth, so normalization is a single division.

End-of-sequence (EOS) tokens require special handling. When a hypothesis emits EOS, it stops expanding but remains in the beam to compete with ongoing sequences. Maintain a separate completed heap; when a hypothesis emits EOS, move it there and free its state slot. At the end of the utterance (detected via silence or explicit stop signal), return the highest-scoring completed hypothesis. If the beam empties before EOS (all hypotheses pruned), fall back to the longest partial hypothesis—this handles truncated audio gracefully.

Streaming and Chunk Boundaries

Real-time ASR processes audio in ~500ms chunks to balance latency and context. Beam search complicates this: hypotheses span chunk boundaries, and you can't simply reset state every 500ms. Instead, maintain beam state across chunks. When a chunk completes, emit a partial transcript from the highest-scoring hypothesis, then continue decoding the next chunk with the same beam.

To prevent runaway memory from infinitely long streams, implement a sliding window: when the trie exceeds 2,000 nodes, prune the oldest 500 nodes (those farthest from any active hypothesis). This bounds memory growth to O(k + window size) while preserving recent context for accurate decoding.

Production Numbers

Deployed in a voice-enabled clinical documentation app, this architecture delivers:

WER improvement: 12-18% relative reduction vs. greedy decode on LibriSpeech test-clean (6.2% → 5.1%)
Latency: 480ms chunk latency + 60ms beam search overhead = 540ms end-to-end
Memory: 32MB incremental (8-beam, 4MB state each) + 2.1MB trie (avg. 1,800 nodes)
CPU: 18% single-core utilization on iPhone 13 (A15), 31% on Pixel 6 (Tensor G1)

The beam width is tunable: k=4 cuts memory to 16MB with only 8% WER improvement; k=16 reaches 20% improvement but doubles CPU load. In practice, k=8 hits the sweet spot for mobile deployment.

Failure Modes and Mitigations

Beam search can still fail when all hypotheses converge prematurely—e.g., if the first few frames strongly suggest a common word, all beams might commit to it, eliminating alternatives. Diversity-promoting techniques like sibling rivalry (penalizing hypotheses that differ only in the last token) help, but add complexity.

Another edge case: homophones. "Their" and "there" sound identical; beam search can't disambiguate without language model context. A shallow n-gram LM (3-5MB) can be fused into scoring with minimal overhead, boosting accuracy another 3-4% on conversational speech.

Takeaways for Mobile ML

Autoregressive beam search demonstrates a recurring mobile ML pattern: pre-allocate fixed budgets, share structure aggressively, and prune early. The circular state pool and prefix trie are general-purpose tools applicable to any autoregressive model—machine translation, code generation, even on-device LLM chat. The key insight is that beam search isn't inherently expensive; naive implementations are. With careful memory layout and lazy expansion, you can deploy sophisticated decoding strategies on resource-constrained devices without compromise.