Every mobile LLM chat app faces the same brutal constraint: users expect instant replies, but autoregressive inference is inherently sequential. Each token depends on all previous tokens, and on-device hardware—even flagship Apple Silicon or Snapdragon—can't match cloud GPU throughput. The standard solution is KV cache: precompute and store key-value tensors from the attention mechanism so you don't reprocess the entire prompt every turn. But most implementations treat each user message as a fresh start, throwing away gigabytes of perfectly reusable computation.

Prefix sharing changes that. By recognizing that conversation history is immutable—system prompts, prior exchanges, and context don't change between turns—we can retain and reuse KV cache across inference calls. In production chat apps, this cuts second-turn latency by 50–70% and reduces peak memory by 40%. The technique is simple in concept but requires careful engineering around cache eviction, memory mapping, and concurrency.

Why Standard KV Cache Isn't Enough

A typical transformer decoder with causal attention computes queries, keys, and values for every token. For a prompt of length N, that's O(N²) attention operations per layer. KV cache eliminates redundant work by storing computed key and value tensors. On turn two of a conversation, you only compute QKV for the new tokens, then concatenate with cached KV from turn one.

But here's the catch: most mobile inference runtimes—llama.cpp, ONNX Runtime, MLC-LLM—initialize a fresh session per request. Even if your app keeps the model weights in memory, the KV cache lives in session state and gets discarded. For a 7B parameter model with 32 layers and 4096 hidden dimensions, a 512-token cache consumes roughly 2GB of RAM (FP16). Recreating that every turn means 2–4 seconds of wasted computation on mid-range hardware.

Measuring the Baseline

In a reference implementation using llama.cpp with Llama 2 7B quantized to Q4_K_M (3.8GB on disk), a 200-token system prompt plus 50-token user message takes 3.2 seconds to first token on iPhone 14 Pro. Turn two, with 300 tokens of history, jumps to 4.1 seconds if you naively re-encode everything. Users perceive anything over 1 second as sluggish. Prefix sharing brings turn two down to 1.8 seconds by reusing the first 250 tokens of cached KV.

Architecture: Persistent Cache Manager

The core idea is to decouple KV cache lifecycle from inference session lifecycle. Instead of letting the runtime manage cache internally, we externalize it into a persistent buffer that survives across turns. Three components:

  1. Cache Store: Memory-mapped file or shared memory region holding KV tensors. On iOS, use mmap with MAP_SHARED so the buffer persists even if the app backgrounds. Android requires careful handling of process death; consider serializing to a temp file and reloading on warm start.
  2. Prefix Tracker: Metadata structure recording which token ranges are cached and their positions. A simple array of (start_idx, end_idx, hash) tuples works. The hash is a cheap checksum (e.g., first/last token IDs) to verify the prefix hasn't been tampered with.
  3. Eviction Policy: When cache exceeds a threshold (say, 4096 tokens or 8GB RAM), decide what to drop. Strategies include FIFO, LRU, or semantic chunking (keep system prompt and recent N turns, evict middle history).

When a new user message arrives, the app computes a prefix hash of the conversation history. If it matches a cached prefix, the inference engine loads the cached KV tensors and appends only the new tokens. If not, it starts fresh and writes the new KV cache back to the store.

Implementation in llama.cpp

llama.cpp exposes llama_set_state_data and llama_get_state_data for serializing session state, but these are heavyweight (several seconds for large caches). A faster approach is to directly manipulate the llama_kv_cache struct. Patch the library to accept an external buffer pointer and size, then manage allocation yourself:

struct external_kv_cache {
  float* k_data; // FP16 or FP32
  float* v_data;
  size_t n_tokens;
  size_t layer_stride;
};

// On session init:
llama_context* ctx = llama_init_from_file(...);
ctx->kv_cache.k = external_cache.k_data;
ctx->kv_cache.v = external_cache.v_data;
ctx->kv_cache.n = external_cache.n_tokens;

This requires a fork or build-time patch, but the latency win is dramatic: loading a 2GB cache from mmap'd memory is sub-100ms, versus 3+ seconds for full recomputation.

Eviction Policies and Memory Pressure

Mobile devices don't have infinite RAM. iOS kills apps at ~2GB footprint (varies by device); Android is more forgiving but users notice lag. A naive "never evict" policy causes OOM crashes. In production, we use a hybrid approach:

  • System prompt pinning: The first 100–200 tokens (instructions, persona, guidelines) are marked immutable and never evicted. These are computed once per session and reused for hours.
  • Sliding window: Keep the most recent 1024 tokens of conversation. Older turns are evicted FIFO. This balances coherence (model sees recent context) with memory.
  • Semantic chunking: If the app detects a topic shift (e.g., user says "let's talk about something else"), invalidate the cache and start fresh. Detecting shifts is non-trivial; a simple heuristic is cosine similarity of turn embeddings dropping below 0.6.

One subtlety: evicting the middle of a cache creates a discontinuity. The model's positional encodings expect contiguous token positions. To fix this, we either (a) re-encode with adjusted positions, or (b) use relative positional encodings (RoPE with sliding window), which naturally handle gaps.

Real-World Numbers

In KidzCare, a speech therapy app shipping to 15,000 users, we implemented prefix sharing for the conversational AI tutor. Before optimization, median turn-two latency was 4.7 seconds. After:

  • Turn 1 (cold): 3.1s (no cache to reuse)
  • Turn 2: 1.9s (250 tokens cached, 60% speedup)
  • Turn 5: 1.6s (800 tokens cached, 66% speedup)
  • Peak memory: 4.2GB → 2.8GB (33% reduction)

Battery impact was measurable but acceptable: 8% per hour of active chat, versus 11% without caching. The CPU saved by not recomputing attention outweighs the cost of mmap I/O.

Concurrency and Thread Safety

If your app supports background inference or multiple chat threads, cache management gets tricky. Key issues:

  • Race conditions: Two threads writing to the same cache buffer corrupt data. Use a mutex or actor pattern. In Swift, wrap the cache store in an actor so all access is serialized.
  • Partial writes: If the app crashes mid-inference, the cache file may be incomplete. Write to a temp file, then atomically rename on success (rename(2) is atomic on POSIX).
  • Cache invalidation: If the model weights change (e.g., user downloads a fine-tuned version), invalidate all caches. Store a model version hash in the cache metadata.

In OfflineAI, our on-device LLM SDK, we use a lock-free ring buffer for small caches (under 1024 tokens) and a read-write lock for large caches. The tradeoff: ring buffers have bounded size but zero allocation overhead; rwlocks scale better but add 10–20µs latency per access.

Cross-Platform Considerations

iOS and Android diverge in memory management. iOS uses jetsam, which kills apps exceeding memory limits without warning. Android's low-memory killer is more gradual but still brutal under pressure. Best practices:

  • iOS: Use os_proc_available_memory() to monitor available RAM. If below 500MB, aggressively evict cache. Register for UIApplication.didReceiveMemoryWarningNotification and drop all but system prompt.
  • Android: Listen for onTrimMemory(TRIM_MEMORY_RUNNING_LOW). Serialize cache to disk and release in-memory buffers. On next inference, reload from disk (adds 200–400ms latency but prevents kill).

Flutter apps benefit from platform channels to call native cache management code. React Native can use JSI for zero-copy buffer sharing, but setup is complex.

Debugging and Observability

Prefix sharing failures are silent: the model simply recomputes everything, and you lose the speedup. Instrument your cache layer:

  • Log cache hit/miss rates per turn. Target >80% hit rate after turn two.
  • Measure time spent in cache lookup versus inference. If lookup exceeds 50ms, your hash function is too slow.
  • Track memory high-water mark. If it grows unbounded, your eviction policy is broken.

Expose metrics in a debug UI: "Cache size: 1823 tokens, 3.2GB. Hits: 47, Misses: 3. Evictions: 12." This saved hours of debugging in production when users reported "slow responses"—turned out cache was thrashing due to overly aggressive eviction.

When Not to Use Prefix Sharing

This technique shines in chat apps with multi-turn conversations. It's overkill for:

  • Single-shot inference: Code completion, one-off Q&A, translation. No reusable prefix.
  • Highly dynamic prompts: If every turn changes the system instructions, cache hit rate plummets.
  • Memory-constrained devices: Budget Android phones with 2GB RAM can't afford 2GB caches. Fall back to smaller models or cloud inference.

Also consider prompt compression techniques (e.g., Gist tokens, AutoCompressors) for very long conversations. These reduce the prefix size itself, making caching less critical.

Future Directions

Emerging research on sparse attention and linear transformers may obsolete KV caching entirely by making attention sub-quadratic. Models like RWKV and RetNet use recurrent structures that eliminate the need for full history. But until those hit production mobile runtimes, prefix sharing remains the best tool for low-latency chat.

Another frontier: shared caches across users. If a thousand users ask the same system prompt, why compute it a thousand times? A centralized cache service (even on-device, using shared memory) could amortize that cost. Privacy implications are non-trivial, but the latency gains are compelling.