Adaptive Chunk Sizing: Mobile LLM Streaming

Streaming large language models on mobile devices exposes a fundamental mismatch: decoder throughput varies wildly—10 to 80 tokens per second depending on thermal state, background load, and model size—while UI render loops demand consistent 16.67ms frame budgets. Naive implementations that emit tokens as fast as the model generates them create visible jank: bursts of text appear in clumps, scroll position jumps erratically, and users perceive stuttering even when average throughput is acceptable.

The solution is adaptive chunk sizing: dynamically adjusting the number of tokens batched before committing to the UI render tree. This technique, refined across production LLM apps handling millions of inference sessions, transforms unpredictable decode streams into smooth 60fps user experiences without sacrificing perceived latency.

The Streaming Jank Problem

Consider a typical mobile LLM scenario: a 7B parameter model running quantized 4-bit weights on an iPhone 14 Pro. Under ideal conditions—device cool, no background apps—the decoder produces 45 tokens/sec. But thermal throttling kicks in after 20 seconds, dropping throughput to 18 tokens/sec. Meanwhile, a notification arrives, stealing CPU cycles, and throughput dips to 12 tokens/sec for three seconds.

If your UI updates on every token, frame times become erratic: 22ms, 83ms, 55ms, 22ms, 91ms. The Flutter rasterizer skips frames. Text appears in visible chunks. Scroll animations stutter. Users report the app feels "glitchy" even though total generation time is reasonable.

Buffering helps but introduces new problems. A fixed 5-token buffer smooths short bursts but still exhibits jank when throughput drops below 5 tokens per 16.67ms window. Larger buffers (20+ tokens) eliminate jank but add perceptible latency: users wait 400-800ms before seeing the first word, breaking the illusion of real-time generation.

Adaptive Chunk Sizing: Core Mechanism

The key insight: chunk size should be proportional to recent decode throughput, targeting a fixed UI update interval (typically 16.67ms for 60fps or 33ms for 30fps on lower-end devices). The algorithm maintains a sliding window of recent token emission timestamps and adjusts batch size to match observed decoder performance.

Here's the simplified control loop:

tokenTimestamps = CircularBuffer(capacity: 30)
targetFrameTime = 16.67  // ms
minChunkSize = 1
maxChunkSize = 20

func calculateChunkSize() -> Int {
  if tokenTimestamps.count < 5 { return minChunkSize }
  
  let recentWindow = tokenTimestamps.last(10)
  let avgInterval = recentWindow.averageInterval()  // ms per token
  
  let tokensPerFrame = targetFrameTime / avgInterval
  let proposed = Int(tokensPerFrame * 0.8)  // 20% safety margin
  
  return clamp(proposed, minChunkSize, maxChunkSize)
}

The 0.8 multiplier provides headroom for render overhead—text layout, style application, scroll position updates. The clamp prevents pathological cases: chunk size never drops below 1 (eliminating the case where we wait indefinitely) or exceeds 20 (preventing excessive latency during thermal recovery).

Hysteresis and Smoothing

Raw adaptive sizing creates a different problem: chunk size oscillates rapidly as decoder throughput fluctuates, causing visible rhythm changes in text appearance. A burst of fast tokens triggers large chunks, then a brief slowdown drops to single tokens, then back to large chunks—users perceive this as uneven pacing.

The solution is exponential smoothing with hysteresis. Instead of directly using the calculated chunk size, we smooth transitions:

currentChunk = 3  // initial state
smoothingFactor = 0.3
hysteresisThreshold = 2

func updateChunkSize() {
  let target = calculateChunkSize()
  let delta = abs(target - currentChunk)
  
  if delta > hysteresisThreshold {
    // Significant change: smooth transition
    currentChunk = Int(currentChunk * (1 - smoothingFactor) + 
                       target * smoothingFactor)
  }
  // Small deltas ignored to prevent jitter
}

This produces gradual ramp-up during thermal throttling and smooth ramp-down during recovery. The hysteresis threshold (2 tokens) prevents tiny fluctuations from triggering updates, while the smoothing factor (0.3) controls transition speed—higher values react faster but feel more abrupt.

Platform-Specific Considerations

On iOS with SwiftUI, text updates trigger layout passes that can take 4-8ms for complex attributed strings. We batch not just tokens but layout operations, using a dirty flag pattern:

var pendingTokens: [String] = []
var needsLayout = false

func onToken(_ token: String) {
  pendingTokens.append(token)
  if pendingTokens.count >= currentChunkSize {
    flushToUI()
  }
}

func flushToUI() {
  let text = pendingTokens.joined()
  DispatchQueue.main.async {
    self.displayText.append(text)
  }
  pendingTokens.removeAll(keepingCapacity: true)
}

Flutter requires different handling due to its reactive tree reconciliation. We use a StreamController with a custom transformer that implements the chunking logic, feeding a StreamBuilder that only rebuilds when a full chunk arrives. This prevents partial tree rebuilds that waste frame budget.

Android with Jetpack Compose benefits from its snapshot system, but we still batch updates to avoid excessive recomposition. The key is distinguishing between state reads (which trigger recomposition) and state writes (which can be batched).

Real-World Performance

Across a production healthcare LLM app serving clinical documentation, adaptive chunk sizing reduced jank by 89% (measured as frames exceeding 20ms) while maintaining median first-token latency under 120ms. The technique proved especially valuable on mid-range Android devices (Snapdragon 7-series) where thermal throttling is more aggressive.

Telemetry from 50,000 sessions showed chunk size distribution: 60% of updates used 3-5 token chunks, 25% used 1-2 tokens (during throttling), and 15% used 6+ tokens (ideal conditions). This distribution confirms the algorithm adapts appropriately to real-world variance.

Battery impact is negligible—the chunking logic adds