Video streaming solved adaptive delivery a decade ago: HLS and DASH segment content into multiple bitrate ladders, letting clients switch quality mid-stream as bandwidth fluctuates. Large language model inference on mobile devices faces a parallel challenge—token generation must adapt to available compute, thermal state, and battery headroom. Yet most mobile LLM implementations run at a fixed configuration until they crash or throttle. This article explores bitrate ladder patterns for token streaming: dynamically adjusting batch size, quantization precision, and context window to maintain target latency under changing conditions.
The Mobile Inference Envelope
A modern smartphone can sustain roughly 8–12 tokens per second from a 3B parameter model at 4-bit quantization before thermal limits kick in. Push harder and the SoC throttles within 90 seconds, dropping throughput by 40%. Meanwhile, battery drain scales superlinearly with clock frequency—running at peak performance costs 3× the energy per token compared to a moderate cadence. The naive approach treats inference as a binary on/off switch. The adaptive approach recognizes a continuum of operating points.
Consider a chat application where the user sends a 200-token prompt. The model must generate a 400-token response. At 10 tok/s, that's 40 seconds of continuous inference. If the device is already warm from prior use, thermal throttling will degrade the tail of the response. If battery is below 20%, aggressive inference drains the last reserves. A bitrate ladder strategy monitors these signals and adjusts generation parameters in real time.
Ladder Rung Definition
A bitrate ladder for LLM inference defines discrete operating modes, each with a quality-speed-power tradeoff. A three-rung ladder might look like:
- High: 4-bit quantization, 2048 context window, batch size 1, no speculative decoding. Target 10 tok/s, 2.8W sustained.
- Medium: 3-bit quantization, 1024 context window, batch size 1, temperature 0.8 (faster sampling). Target 14 tok/s, 2.2W sustained.
- Low: 2-bit quantization, 512 context window, greedy decoding, pruned vocabulary (top 20K tokens). Target 20 tok/s, 1.6W sustained.
Each rung trades output quality for throughput. The high rung preserves model fidelity but risks thermal throttling. The low rung sacrifices nuance but keeps the UI responsive even on aging hardware or low battery. The key insight: users tolerate slightly less eloquent responses if they arrive without stutter.
Switching Triggers
Switching between rungs requires telemetry. On iOS, the ProcessInfo.processInfo.thermalState API exposes four states: nominal, fair, serious, critical. Android's PowerManager provides similar signals. A simple state machine monitors thermal state, battery level, and recent token latency:
if thermalState >= .serious || batteryLevel < 15%:
switch to Low
elif averageLatency > targetLatency * 1.3:
step down one rung
elif averageLatency < targetLatency * 0.7 && thermalState == .nominal:
step up one rungHysteresis prevents thrashing: a rung change requires three consecutive measurements crossing the threshold. The window size for averageLatency is 20 tokens—short enough to react, long enough to filter noise.
Mid-Stream Quantization Swap
Switching quantization precision mid-generation is non-trivial. The naive approach re-loads the entire model at the new precision, costing 1–2 seconds and breaking output continuity. A better pattern: keep two model instances memory-mapped (4-bit and 2-bit), share the KV cache structure, and serialize only the attention states when switching. On a device with 6GB available RAM, a 3B model at 4-bit occupies ~1.8GB, 2-bit occupies ~1.1GB. Keeping both resident costs 2.9GB—acceptable if the app pre-warms both at launch.
When stepping down from high to low, the runtime copies the last 512 tokens of KV cache from the 4-bit instance to the 2-bit instance, then resumes generation. The model "forgets" earlier context but maintains coherence for the immediate conversation turn. Users perceive this as the model becoming slightly more terse, not as a hard reset.
Context Window Pruning
Reducing context window from 2048 to 512 tokens cuts memory bandwidth by 75% and speeds attention computation proportionally. The challenge: which tokens to discard? A sliding window keeps the most recent 512 tokens, losing long-range dependencies. A smarter heuristic preserves the system prompt (first 128 tokens) and the most recent user-assistant exchange (last 384 tokens), discarding the middle. This "bookend" strategy maintains instruction adherence while trimming filler.
For multi-turn conversations, the app can persist a compressed summary of pruned turns in a separate embedding vector (generated offline or via a smaller summarization model). When the user references something from earlier in the chat, the app retrieves the summary and injects it as a synthetic turn. This pattern mirrors how video encoders use I-frames and P-frames—full context at key moments, deltas elsewhere.
Vocabulary Subsetting
A full LLM vocabulary spans 32K–128K tokens. Mobile chat apps rarely need the long tail—obscure Unicode, legacy encodings, domain-specific jargon. Pre-filtering the vocabulary to the top 20K most frequent tokens shrinks the logits tensor from 128K floats to 20K, reducing softmax cost by 6×. The tradeoff: the model cannot generate rare words and will substitute common synonyms.
In practice, this works well for conversational UI. A user asking "How do I sauté vegetables?" gets a response using "cook" instead of "sauté"—less precise but still useful. The app can flag vocabulary-limited mode with a subtle UI indicator (e.g., a "fast mode" badge) so users understand the tradeoff.
Backpressure and Token Buffering
When stepping down a rung, the model generates tokens faster than the UI can render them. A naive implementation floods the main thread with layout updates, causing jank. The solution: a token buffer with backpressure. The inference thread writes tokens to a ring buffer; the UI thread reads at 60fps, rendering up to 3 tokens per frame. If the buffer fills (inference outpacing display), the inference thread blocks until space opens. This keeps the UI smooth and prevents memory bloat from unbounded queuing.
Conversely, when stepping up a rung, inference slows. The buffer drains, and the UI must handle empty reads gracefully. A placeholder animation (e.g., a pulsing cursor) signals that the model is thinking. The user perceives intentional pacing rather than a frozen app.
Telemetry and Rung Selection
Optimal rung selection depends on device capability. An iPhone 15 Pro can sustain the high rung indefinitely; an iPhone 12 mini throttles after 30 seconds. The app should profile the device on first launch: run a 10-second inference benchmark at each rung, measure sustained tok/s and power draw, then cache the results. Subsequent sessions use this profile to select the default rung and set switching thresholds.
Aggregate telemetry across users reveals device cohorts. A fleet of iPhone 13 devices might show that 80% can sustain medium rung for 2-minute conversations, while only 40% handle high rung without throttling. Product teams can use this data to set conservative defaults and offer a "performance mode" toggle for power users.
Lessons from Video Streaming
HLS taught us that users prefer smooth playback over peak quality. A video that stutters at 1080p is worse than one that plays fluidly at 720p. The same holds for LLM streaming: a response that arrives in 20 seconds at medium quality beats one that takes 45 seconds (with 15 seconds of thermal throttling) at high quality. The ladder pattern prioritizes consistency over peak performance.
Another lesson: segment boundaries matter. Video switches quality at keyframes to avoid artifacts. LLM streaming should switch rungs at sentence boundaries (detected via punctuation or a lightweight sentence segmenter). Mid-sentence switches create jarring tonal shifts—"The recipe requires you to carefully sauté" (high rung) → "the onions cook them until soft" (low rung, vocabulary-limited). Deferring the switch by 5–10 tokens to reach a period improves perceived quality.
Real-World Impact
Implementing a three-rung ladder in a mobile chat app reduced thermal throttling incidents by 70% and improved P95 latency by 35% on older devices. Users on iPhone 11 and Pixel 5 saw response times drop from 50+ seconds to under 30 seconds for long-form answers. Battery drain per conversation decreased by 22% on average, as the app spent more time in medium/low rungs rather than thrashing at high rung until forced to throttle.
The pattern also unlocked new use cases. A healthcare app providing medication guidance could run high rung for critical queries (drug interactions) and low rung for general wellness tips, balancing accuracy with battery life during long shifts. An e-commerce assistant could step down to low rung during peak traffic (Black Friday) to serve more concurrent users without overheating devices.
Implementation Notes
For Flutter apps, implement the ladder logic in Dart isolates to avoid blocking the UI thread. For Swift/SwiftUI, use a dedicated DispatchQueue with QoS .userInitiated. React Native apps should offload to a native module—JavaScript cannot reliably measure thermal state or control quantization. Cross-platform frameworks like ONNX Runtime support model swapping via session reconfiguration, but KV cache transfer requires custom bindings.
Memory-mapping both quantization levels at launch increases cold-start time by ~400ms. Mitigate this with lazy loading: start with medium rung, load high and low in the background. The first conversation runs at medium; subsequent ones can switch freely. This pattern mirrors how video players pre-buffer multiple quality levels after playback starts.
Future Directions
Next-generation mobile SoCs (A18, Snapdragon 8 Gen 4) include dedicated NPU blocks with dynamic voltage/frequency scaling. These chips can adjust power per layer rather than per model, enabling finer-grained ladders—e.g., run attention at high precision, FFN at low precision. The ladder abstraction extends naturally: define rungs as per-layer quantization profiles rather than monolithic model configs.
Another frontier: collaborative ladders. In a multi-device scenario (phone + watch + tablet), the phone runs high rung while offloading low-priority queries to the watch at low rung. The user sees instant responses on the phone, with background tasks (summarization, translation) completing on the watch. This mirrors CDN edge caching in video delivery—origin server (phone) handles quality, edge nodes (wearables) handle scale.