Variable-Rate Shaping: LLM Token Emission Control

Streaming large language models emit tokens at irregular intervals—sometimes bursting dozens of tokens in milliseconds, other times stalling for hundreds of milliseconds mid-sentence. This jitter creates a jarring user experience: words flicker onto screen in unreadable clumps, then freeze, destroying the illusion of fluid thought. Yet the underlying model inference speed remains constant. The problem isn't compute; it's presentation.

Variable-rate shaping borrows from telecom traffic management to smooth token delivery without changing model throughput. By buffering and metering tokens between the inference engine and UI layer, we can transform erratic output into perceptually uniform streams—improving comprehension, reducing perceived latency, and enabling predictable animations. This article dissects three shaping strategies, their tradeoffs, and production implementation details from mobile LLM products handling millions of queries.

The Token Jitter Problem

Modern autoregressive LLMs generate one token per forward pass. On a Snapdragon 8 Gen 2 running a 7B parameter model via ONNX Runtime, median token latency hovers around 180ms. But variance is brutal: P50 might be 175ms, P95 reaches 420ms, and P99 spikes to 1.2 seconds when thermal throttling kicks in or the OS preempts the inference thread.

Naive streaming—rendering each token immediately upon generation—exposes every hiccup. Users see:

Bursts of 3-5 tokens appearing simultaneously when the model catches up after a stall
Multi-second pauses mid-word if a cache miss or GC pause interrupts inference
Inconsistent reading rhythm that forces re-scanning of text

Measuring with eye-tracking hardware in a speech therapy app revealed users fixate 2.3× longer on jittery output versus shaped streams, directly impacting comprehension for children with reading delays.

Token Leaky Bucket: Constant-Rate Smoothing

The simplest shaper is a leaky bucket: tokens drain into a FIFO queue at generation rate, then emit to the UI at a fixed target rate (e.g., 120ms per token). Implementation requires two threads—one pulling from the model, one pushing to the renderer—with a shared circular buffer.

class TokenBucket {
  private queue: Token[] = [];
  private targetInterval = 120; // ms
  private lastEmit = 0;

  ingest(token: Token) {
    this.queue.push(token);
  }

  tick(now: number) {
    if (now - this.lastEmit < this.targetInterval) return null;
    if (this.queue.length === 0) return null;
    this.lastEmit = now;
    return this.queue.shift();
  }
}

This approach eliminates micro-jitter but introduces head-of-line latency: the first token always waits the target interval, adding 120ms to time-to-first-token. For a chatbot responding to "What is 2+2?", users perceive a 120ms delay before seeing "The", even though the model generated it instantly.

Bucket depth matters. Too shallow (≤3 tokens) and bursts drain immediately, defeating smoothing. Too deep (≥15 tokens) and end-of-response latency balloons—users wait seconds after inference completes to see the final period. Production sweet spot: 6-8 token capacity, tuned per model speed.

Adaptive Rate Shaping: Dynamic Interval Adjustment

Instead of fixed 120ms intervals, adaptive shapers measure recent token generation rates and adjust emission speed to match. When the model accelerates (cache-hot, simple tokens), the shaper speeds up. When inference slows (complex reasoning, thermal throttle), the shaper slows down—always staying slightly behind generation to maintain a small buffer.

The control loop uses exponential moving average of the last 8 token latencies:

targetInterval = 0.7 × EMA(generationLatency) + 0.3 × targetInterval

The 0.7 weight tracks generation speed quickly; the 0.3 damping prevents oscillation. A 5-token minimum buffer prevents underflow: if queue.length < 5, emission pauses until refilled.

This technique reduced P95 perceived latency by 34% in a clinical documentation assistant where physicians dictate notes. Adaptive shaping matched their speech cadence—fast during rote phrases, slower during complex medical terminology—creating a more natural dictation feel.

Phrase-Boundary Shaping: Semantic Chunking

Linguistic research shows humans parse text in prosodic phrases—roughly 4-8 words bounded by commas, periods, or natural pauses. Emitting tokens in phrase-aligned chunks improves readability over word-by-word drip.

A phrase-boundary shaper buffers tokens until detecting a boundary marker (comma, period, conjunction), then releases the entire phrase at once. Detection uses a small lookup table:

const boundaries = new Set([',', '.', '!', '?', 'and', 'but', 'or']);

if (boundaries.has(token.text.toLowerCase()) || 
    queue.length >= 12) {
  emitPhrase(queue);
  queue = [];
}

The 12-token hard limit prevents runaway buffering on long sentences. Each phrase renders with a 40ms stagger between words—fast enough to feel instant, slow enough to create a reading rhythm.

This approach cut comprehension errors by 18% in a kids' reading app where children with dyslexia followed along with generated stories. Phrase chunks gave clear visual boundaries, reducing the cognitive load of tracking individual word appearances.

Implementation: Mobile Platform Constraints

On iOS, token shaping lives in a dedicated DispatchQueue with QoS .userInteractive to prevent preemption. The emission loop runs on CADisplayLink synchronized to 60Hz screen refresh—tokens appear on frame boundaries, eliminating tearing.

Android requires more care. Kotlin coroutines with Dispatchers.Main.immediate handle UI updates, but the shaper itself runs on a custom HandlerThread with THREAD_PRIORITY_DISPLAY. A Choreographer callback ensures frame sync:

Choreographer.getInstance().postFrameCallback { frameTimeNanos ->
  val token = bucket.tick(frameTimeNanos / 1_000_000)
  if (token != null) textView.append(token.text)
}

Battery impact is negligible: the shaper thread wakes 60 times per second but performs 0) { this.emit(this.queue.shift()!); } this.lastEmit = 0; }

For long-running inference (>30s responses), backpressure prevents memory bloat. If the buffer exceeds 50 tokens, the shaper signals the inference thread to pause generation via a semaphore. Generation resumes when the buffer drains below 20 tokens. This caps memory at ~400KB for token strings plus metadata.

Measuring Perceived Latency

Traditional metrics—time-to-first-token, tokens-per-second—don't capture shaping effectiveness. We track:

Jitter coefficient: standard deviation of inter-token intervals. Target