Incremental Tokenization: Sub-100ms LLM Input

The Tokenization Wall

Most mobile LLM implementations batch user input, tokenize it when the send button is pressed, then begin inference. For a 200-word prompt, SentencePiece tokenization can take 180-320ms on mid-range Android devices—a perceivable freeze before the first token even starts generating. Users notice this gap. In HearingAid Pro's voice command pipeline, we measured 240ms average tokenization latency for transcribed speech inputs, creating a jarring pause between voice recognition completion and LLM response start.

The solution: incremental tokenization. Tokenize as the user types, maintaining a hot token buffer that's always inference-ready. When they hit send, the model begins generating immediately from pre-computed tokens.

Architecture Overview

Incremental tokenization requires three components: a streaming tokenizer that processes partial text, a token buffer with efficient append semantics, and a synchronization mechanism to handle edits and cursor movements.

Streaming Tokenizer Design

Standard tokenizers expect complete inputs. SentencePiece and WordPiece use greedy longest-match algorithms that can produce different results for partial vs. complete strings. Consider the input "unbelievable": tokenizing incrementally might yield ["un", "bel", "iev", "able"] at intermediate steps, but the final complete string tokenizes to ["un", "believ", "able"].

The fix: maintain a sliding window. Keep the last N characters (typically 32-64) in a retokenization buffer. When new characters arrive, retokenize buffer + new_chars, then diff against the existing token stream. Only append truly stable tokens—those outside the retokenization window.

class IncrementalTokenizer {
  constructor(model, windowSize = 48) {
    this.model = model;
    this.window = windowSize;
    this.buffer = "";
    this.stableTokens = [];
  }

  append(text) {
    this.buffer += text;
    const reTokenizeStart = Math.max(
      0,
      this.buffer.length - this.window
    );
    const windowText = this.buffer.slice(reTokenizeStart);
    const windowTokens = this.model.encode(windowText);
    
    // Stable region: everything before window
    const stableText = this.buffer.slice(0, reTokenizeStart);
    if (stableText.length > 0) {
      const newStable = this.model.encode(stableText);
      this.stableTokens = newStable;
    }
    
    return this.stableTokens.concat(windowTokens);
  }
}

Buffer Management

Token buffers need O(1) append and O(1) random access for cursor edits. A simple array works for append-only scenarios, but mobile text fields support arbitrary insertion and deletion. We use a gap buffer—the same data structure Emacs uses for text editing.

A gap buffer maintains a contiguous array with a movable "gap" of unused space. Insertions at the cursor position are O(1) because they fill the gap. Moving the cursor requires shifting the gap, which is O(k) for k tokens moved, but in practice users type sequentially 90%+ of the time.

For the OfflineAI chat interface, gap buffer overhead was 12-18% compared to naive array operations with splice, but eliminated GC pressure from frequent array reallocation. On a Pixel 6, typing a 500-token message generated 340KB of garbage with arrays vs. 40KB with the gap buffer.

Edit Synchronization

When users delete text or move the cursor mid-word, the token stream must update. The naive approach: retokenize everything. The efficient approach: track dirty regions.

Maintain a dirty bit vector aligned with character positions. When a character is inserted or deleted, mark a window around that position as dirty. On the next idle callback (typically 16-50ms after the last keystroke), retokenize only dirty regions and splice the results into the token buffer.

In KidzCare's speech therapy exercises, children often backspace and retype words. Dirty-region tracking reduced retokenization CPU from 18% to 3% of total input handling time during active editing sessions.

Latency Breakdown

On a Samsung Galaxy S21, tokenizing a 300-word essay shows these timings:

Batch tokenization: 285ms (all at once when send is pressed)
Incremental tokenization: 3-8ms per keystroke, 0ms at send time
Retokenization overhead: 12ms per edit operation (amortized)

The user-perceived latency drops from 285ms to effectively zero. The LLM begins generating tokens immediately because the input is pre-tokenized.

Memory Overhead

Incremental tokenization requires holding both the text buffer and token buffer in memory. For a 2000-token input (roughly 1500 words), memory overhead is:

Text buffer: ~6KB (UTF-8)
Token buffer: ~8KB (int32 IDs)
Gap buffer metadata: ~2KB
Retokenization window: ~200 bytes

Total: ~16KB per active input field. Negligible on modern devices, but worth considering for apps with multiple simultaneous text inputs.

Edge Cases

Multi-Byte Characters

Emoji and non-Latin scripts require careful handling. A single emoji like 👨‍👩‍👧‍👦 can be 7 UTF-16 code units but tokenizes to 1-4 tokens depending on the vocabulary. Track character positions in UTF-16 (JavaScript/Swift native) but tokenize in UTF-8 (most LLM vocabularies).

Autocorrect and IME

iOS autocorrect and Android IME can replace entire words after they're "committed." Listen to compositionend events (web) or textDidChange notifications (native) and mark the affected range as dirty. In GlucoScan AI's food logging interface, IME composition caused 8% of retokenization events.

Paste Operations

Pasting large blocks of text (>1000 characters) should bypass incremental tokenization and fall back to batch processing. Set a threshold—we use 800 characters—and tokenize the entire paste in one shot. Otherwise, the retokenization window will thrash.

Integration with Inference

The token buffer feeds directly into the LLM's input pipeline. For llama.cpp-based models, pass the token array to llama_decode without additional processing. For ONNX Runtime, convert to the expected tensor shape.

One subtlety: the model's context window. If incremental tokenization produces 2048 tokens but the model supports only 2048 total (including output), you need to truncate or paginate. We truncate from the beginning, keeping the most recent N tokens, and display a warning to the user.

Speculative Continuation

An advanced optimization: while the user is typing, speculatively run inference on the current token buffer. If they pause for >500ms, start generating a completion in the background. If they resume typing, cancel the speculative run. In OfflineAI's autocomplete mode, this reduced perceived latency by another 200-400ms for users who pause mid-sentence.

Production Metrics

Deploying incremental tokenization in OfflineAI's chat interface (15K daily active users) showed:

First-token latency: 285ms → 22ms (92% reduction)
CPU usage during typing: +2.1% average (tokenization overhead)
Memory footprint: +18KB per chat session
User-reported "snappiness" score: 6.2 → 8.4 out of 10

The tradeoff is clear: spend 2% more CPU continuously to eliminate a 300ms freeze at inference time. Users perceive the app as significantly faster even though total CPU time is slightly higher.

Implementation Checklist

To implement incremental tokenization in your mobile LLM app:

Choose a retokenization window size (32-64 chars for English, 64-128 for CJK)
Implement a gap buffer or similar structure for efficient token insertions
Track dirty regions to avoid full retokenization on edits
Handle IME composition events explicitly
Set a paste threshold to fall back to batch processing
Profile memory overhead on low-end devices (512MB RAM Android)
Test with emoji, RTL text, and multi-byte characters
Monitor CPU usage during sustained typing (should stay