Prompt Caching for Mobile LLMs: 40% Latency Cut

The First-Token Bottleneck in Mobile LLM Apps

On-device LLM applications face a persistent UX challenge: the delay between user input and the first visible token. While streaming generation masks subsequent latency, users perceive that initial pause as sluggishness. In production mobile LLM apps—think coding assistants, writing tools, or conversational interfaces—first-token latency typically ranges from 400ms to 1.2s on mid-range Android devices running quantized 3B-7B parameter models.

The bottleneck isn't model evaluation alone. Modern LLM inference involves two distinct phases: prefill (processing the entire prompt to build key-value cache) and decode (generating tokens one by one). Prefill dominates first-token latency, consuming 60-80% of the delay. For a 512-token prompt on a Snapdragon 8 Gen 2 with a 4-bit quantized 7B model, prefill typically takes 320-450ms while each decode step runs 18-25ms.

Prompt caching exploits a key insight: many LLM interactions share common prefix structure. System prompts, few-shot examples, document context, and conversation history repeat across requests. By persisting the KV cache for these shared prefixes, we eliminate redundant prefill computation.

Architecture: Session-Level Cache Design

A production prompt cache operates at the session level, sitting between the application layer and the inference runtime. The cache stores serialized KV tensors keyed by prompt hash, with metadata tracking token count, creation timestamp, and access frequency.

For llama.cpp-based mobile deployments, the cache intercepts llama_decode calls. When a new prompt arrives, the system computes a rolling hash of input token IDs and checks for prefix matches. On cache hit, the runtime loads the cached KV state directly into the model's attention layers, skipping prefill for matched tokens. Only the novel suffix requires computation.

Implementation requires careful memory management. KV cache for a 7B model with 4096 context length consumes roughly 280MB per session at FP16 precision (2 bytes × 32 layers × 32 heads × 128 head_dim × 4096 tokens). Quantizing cached keys and values to INT8 halves this to 140MB with negligible quality impact in practice. Mobile apps typically maintain 2-4 cached sessions, totaling 300-600MB—acceptable on devices with 6GB+ RAM.

Eviction Policy: LRU with Pinning

Cache eviction balances memory pressure against hit rate. A pure LRU policy works for general use, but production apps benefit from pinning high-value entries. System prompts and core few-shot examples remain pinned, while user-specific context follows LRU. When memory pressure exceeds a threshold (typically 70% of allocated budget), the system evicts unpinned entries oldest-first.

Pinning requires application-level hints. In a coding assistant, the base system prompt instructing the model to generate Swift code might be pinned, while project-specific context rotates based on recency. This hybrid approach maintains 85-90% hit rates in real usage patterns versus 60-65% for pure LRU.

Prefix Matching: Trie-Based Lookup

Efficient prefix matching requires more than simple hash comparison. Token sequences form a natural trie structure where common prefixes share nodes. The cache builds a token-level trie where each node stores a pointer to its KV state snapshot.

On lookup, the system traverses the trie with incoming tokens, identifying the longest matching prefix. If the match covers 400 of 520 input tokens, the cache returns the KV state at that node, and prefill processes only the remaining 120 tokens. This partial hit still delivers 70-75% of the full cache benefit.

Trie depth matters for performance. A shallow trie (depth 3-5) reduces lookup overhead but misses fine-grained reuse. Deep tries (depth 15-20) capture more sharing but increase traversal cost. Empirical testing across conversational and document-grounded workloads suggests depth 8-12 provides optimal balance, with P99 lookup latency under 2ms.

Concurrent Access Patterns

Mobile apps rarely run multiple inference sessions simultaneously, but background pre-warming complicates cache access. When the app anticipates a user interaction (e.g., opening a chat screen), it may pre-fill cache entries in a background thread. Concurrent reads are safe with immutable KV snapshots, but writes require coordination.

A simple read-write lock suffices for most mobile use cases. Write operations (cache insertion after prefill) acquire exclusive access, while reads (lookup during decode) use shared locks. Lock contention remains negligible because writes occur only during prefill, which already dominates latency. In practice, lock acquisition adds