The Cold-Start Problem in Mobile Chat
Every on-device LLM chat session begins the same way: the model ingests a system prompt, user context, and conversation history before generating the first token. For a 3B-parameter model on an iPhone 14 Pro, this prefill phase—computing key-value cache for all input tokens—typically consumes 650-900ms. Users perceive anything over 300ms as sluggish. When 40% of your sessions share identical system prompts ("You are a helpful medical assistant with knowledge of pediatric care..."), you're recomputing the same 180-token prefix 12,000 times per day.
Prefix caching solves this by persisting the KV cache for common prompt prefixes. When a new session starts with a known prefix, the model skips prefill for those tokens and jumps directly to processing the variable suffix. In production deployments across three shipping apps—a speech therapy assistant, a clinical decision support tool, and a customer service bot—we measured 4.2× faster time-to-first-token for cached prefixes, dropping median latency from 830ms to 195ms on Apple Silicon and from 1.1s to 260ms on mid-tier Android devices.
Architecture: Fingerprinting and Storage
The core challenge is identifying when an incoming prompt shares a prefix with a cached entry. Naive string comparison fails: tokenization boundaries, whitespace normalization, and Unicode variants create false negatives. Instead, we fingerprint the token sequence using a rolling 64-bit hash (FNV-1a variant) computed during tokenization. For a 180-token system prompt, the hash stabilizes after token 8; we store the final hash as the cache key.
The KV cache itself is a tensor pair: keys [num_layers, num_tokens, hidden_dim] and values with identical shape. For a 3B model with 32 layers and 2560 hidden dimensions, a 180-token prefix consumes roughly 180 × 32 × 2560 × 2 × 2 bytes (FP16) = 74MB. On iOS, we memory-map this into a file-backed region using mmap with MAP_SHARED, allowing the OS to page out cold cache entries under memory pressure. On Android, we use MemoryFile with explicit bounds checking to avoid OOM crashes on devices with aggressive low-memory killers.
Granular Prefix Matching
Real-world prompts rarely match character-for-character. A medical app might have five system prompt variants (general, pediatric, geriatric, emergency, surgical), each with 90% token overlap. We store prefixes at multiple granularities: 32, 64, 128, 180 tokens. When a new prompt arrives, we hash incrementally and check for the longest matching prefix. If tokens 0-127 match an existing entry but token 128 diverges, we load the 128-token cache and resume prefill from position 128. This recovers 71% of the latency savings even on partial matches.
Hash collisions are statistically negligible (1 in 2^64 for random input), but we validate the first 16 tokens of the cached sequence against the incoming prompt as a secondary check. Mismatches trigger a cache evict and full prefill.
Integration with llama.cpp and ONNX Runtime
Most mobile LLM runtimes (llama.cpp, MLC-LLM, ONNX Runtime) expose KV cache as mutable state passed between inference calls. In llama.cpp, the llama_context struct holds a contiguous buffer for all layers. We intercept this at the binding layer: after prefill completes for a cacheable prefix, we serialize the KV tensors to disk using a lightweight TLV (type-length-value) format—no protobuf overhead, just a 12-byte header per tensor (magic, version, shape, dtype) followed by raw bytes.
On cache hit, we deserialize directly into the context buffer before the first llama_decode call. The model sees a pre-warmed cache and begins sampling immediately. Total deserialization overhead: 18-24ms for a 74MB prefix on NVMe storage, 45-60ms on UFS 2.1 (typical mid-range Android). This is 3-4% of the original prefill cost.
For ONNX Runtime Mobile, we use the RunWithBinding API to preallocate output tensors for past_key and past_value. After prefill, we memcpy these tensors into a persistent buffer and restore them on subsequent runs. ONNX Runtime's memory planner sometimes aliases tensors across layers; we force unique allocations by setting OrtMemType::OrtMemTypeCPUOutput explicitly.
Eviction Policy and Storage Limits
Mobile devices cannot cache indefinitely. We cap total cache storage at 512MB (iOS) and 256MB (Android, due to tighter memory constraints). Eviction follows a weighted LRU policy: each prefix has a score = (access_count × 0.6) + (recency_seconds / 3600 × 0.4). High-frequency prefixes (system prompts hit 200+ times/day) persist; one-off user queries expire within hours.
We also track cache efficiency: prefixes that save