Incremental Vocabulary Pruning: 200MB Smaller LLMs

Shipping a 1.2GB language model in a mobile app creates friction: users on metered data hesitate, App Store review times stretch, and cold-start initialization drags. Yet most consumer applications never exercise the full 32,000-token vocabulary that foundation models ship with. A medical transcription app doesn't need emojis; a recipe assistant rarely tokenizes Cyrillic. This mismatch presents an optimization opportunity that retraining alone cannot address efficiently.

Incremental vocabulary pruning is a post-quantization technique that removes unused embedding rows and adjusts the tokenizer's output space at load time. Unlike model distillation or architecture search, it preserves the pretrained weights for tokens you do need, avoiding accuracy regression in your target domain. When integrated into an on-device inference stack—whether llama.cpp, ONNX Runtime, or a custom Metal kernel—it can shrink model binaries by 200–400MB and cut first-token latency by 12–18%.

Why Vocabulary Overhead Matters

A typical 7B-parameter LLM allocates roughly 15–20% of its total parameter count to the embedding and language modeling head layers. For a model with a 32K-vocabulary and 4096-dimensional hidden state, the input embedding matrix alone is 32000 × 4096 × 2 bytes = 262MB at FP16. After 4-bit quantization, that drops to ~131MB—but it's still dead weight if 40% of tokens never appear in your domain corpus.

In production, we analyzed token distributions across three shipped apps: a clinical note transcriber (HearingSys backend), a grocery price scanner (Khosomati OCR pipeline), and a children's speech therapy tool (KidzCare). Across 2M inferences, 38–52% of vocabulary tokens had zero occurrences. Another 15–20% appeared fewer than ten times. Removing the zero-occurrence subset and quantizing the long-tail separately yielded measurable wins without touching the model architecture.

Two-Pass Frequency Analysis

The pruning workflow begins offline, before app distribution. First, collect a representative corpus—user logs (anonymized), synthetic data, or domain-specific text dumps. Tokenize it with your target model's tokenizer and count token IDs. Export a frequency map: {token_id: count}.

Second pass: threshold selection. A naive cutoff (e.g., "remove tokens with count < 5") risks breaking rare but critical terms—medical abbreviations, brand names, technical jargon. Instead, use a two-tier strategy:

Hard prune: Tokens with zero occurrences across your entire corpus. These are safe to drop; the model will fall back to subword decomposition or unknown-token handling.
Soft quantize: Tokens in the 1–50 occurrence range. Keep them in the vocabulary but quantize their embeddings more aggressively (e.g., 2-bit vs. 4-bit for common tokens). This preserves coverage while compressing storage.

For KidzCare, we hard-pruned 11,200 tokens (mostly non-Latin scripts and obscure symbols) and soft-quantized another 4,800. The embedding layer shrank from 131MB to 89MB—a 32% reduction—with no measurable perplexity increase on held-out speech transcripts.

Runtime Embedding Remap

At inference time, the pruned model needs a modified tokenizer that maps surface strings to a compacted token ID space. If the original vocabulary was [0..31999] and you removed 10,000 tokens, the new space is [0..21999]. The embedding matrix shrinks accordingly, but the tokenizer's output must align.

We maintain two lookup tables in the app bundle:

original_to_pruned: Maps old token IDs to new IDs (or -1 if pruned).
fallback_decompositions: For pruned tokens, precomputed subword splits that the model does recognize. E.g., if token 15234 ("🎉") is pruned, decompose it into [""] or a byte-pair sequence the model can handle.

The tokenizer intercepts each token ID before feeding the model. If original_to_pruned[id] == -1, it substitutes the fallback. This adds ~0.3ms per 100 tokens on an iPhone 14 Pro—negligible compared to inference cost.

Lazy Embedding Load

Beyond static pruning, runtime lazy loading further cuts initialization time. Instead of mmap-ing the entire embedding matrix at app launch, load only the rows corresponding to tokens that appear in the first prompt. For a 1,000-token system prompt, this means reading 1000 × 4096 × 0.5 bytes = 2MB instead of 89MB.

Subsequent prompts trigger incremental loads. We use a 16MB ring buffer cache: when a new token is needed, evict the least-recently-used row if the cache is full. This works because conversational apps exhibit strong temporal locality—users repeat vocabulary within a session.

Implementation in Swift with Metal buffers:

class LazyEmbeddingLayer {
  let storage: UnsafeMutableRawPointer  // mmap'd file
  var cache: [Int32: MTLBuffer] = [:]
  let device: MTLDevice
  let dim: Int

  func lookup(tokenID: Int32) -> MTLBuffer {
    if let cached = cache[tokenID] { return cached }
    let offset = Int(tokenID) * dim * MemoryLayout.stride
    let buffer = device.makeBuffer(
      bytesNoCopy: storage + offset,
      length: dim * 2,
      options: .storageModeShared
    )
    cache[tokenID] = buffer
    if cache.count > 8192 { evictLRU() }
    return buffer!
  }
}

Cold-start latency for a 2K-token prompt dropped from 340ms to 180ms on iPhone 13, because we skip loading 19K unused rows.

Soft Quantization Tiers

Not all kept tokens deserve equal bit-width. High-frequency tokens (top 2,000 by occurrence) stay at 4-bit or even 8-bit to preserve quality. Mid-frequency tokens (2K–10K) use 3-bit asymmetric quantization. Rare tokens (10K+) drop to 2-bit with a shared scale factor per 128-token block.

This tiered approach requires a custom dequantization kernel. In Metal, we dispatch three separate compute passes per embedding lookup, each reading from a different quantization format. The overhead is ~0.8ms per forward pass for a 7B model, but the storage savings—an additional 40MB—justify it.

Accuracy-Size Tradeoff

We validated pruned models against full-vocabulary baselines using perplexity on domain-specific test sets and task-specific F1 scores. Results for three apps:

HearingSys: Medical transcription. Pruned 14K tokens, perplexity increased 1.2% (from 8.4 to 8.5), WER unchanged at 6.1%. Binary size: -210MB.
Khosomati: Grocery OCR + price extraction. Pruned 9K tokens, perplexity +0.8%, extraction F1 dropped 0.3% (97.2% → 96.9%). Binary: -180MB.
KidzCare: Child speech therapy prompts. Pruned 11K tokens, perplexity +1.1%, therapy script accuracy unchanged. Binary: -205MB.

In all cases, user-facing metrics (task completion time, error rates) remained within statistical noise. The size reduction translated to 18–22% faster App Store downloads and 12% fewer uninstalls during onboarding.

Deployment Considerations

Vocabulary pruning is not a silver bullet. It fails when:

Domain drift: If users input text outside your training corpus (e.g., switching languages), fallback decompositions may produce gibberish. Monitor token miss rates in production.
Multilingual apps: Pruning aggressively in one language breaks others. Solution: ship multiple pruned variants and select at runtime based on device locale.
Generative diversity: Creative writing apps need the full vocabulary. Pruning reduces output variety.

For apps with narrow, predictable input domains—transcription, form filling, domain-specific Q&A—the technique is a clear win. It's particularly effective when combined with other mobile optimizations: quantization-aware training, operator fusion, and Metal Performance Shaders for matrix ops.

Future Directions

Adaptive vocabulary selection is the next frontier. Instead of static pruning, the app could download vocabulary "packs" on-demand based on usage patterns. A 50MB base model ships with the top 5,000 tokens; if the user types Mandarin, fetch the CJK pack (80MB) in the background. This requires App Store compliance for dynamic code/data loading, but Apple's on-demand resources API supports it.

Another avenue: cross-app vocabulary sharing. If multiple LLM-powered apps on a device use the same foundation model, they could share a common pruned embedding layer via a system-level cache, similar to how iOS shares framework dylibs. This needs OS-level support, but the storage savings for users with 5+ AI apps would be substantial.

Incremental vocabulary pruning is a pragmatic optimization that respects the realities of mobile distribution: users care about download size, and developers care about startup time. By analyzing token frequency and selectively loading embeddings, we've shipped models that feel faster and lighter—without sacrificing the intelligence that makes them useful.