Multi-turn conversational AI on mobile devices faces a brutal tradeoff: cache the entire dialogue history for coherent responses, or discard context to stay within RAM budgets. Most production apps choose neither extreme, instead implementing naive sliding windows that truncate at arbitrary token counts. The result? Chatbots that forget critical context mid-conversation, or apps that crash after six exchanges.
A dual-stream key-value cache architecture solves this by treating conversation state as two distinct data structures with different lifecycle policies. The persistent stream holds immutable system prompts, user preferences, and long-term facts. The ephemeral stream contains recent dialogue turns, candidate responses, and speculative branches. This separation enables aggressive pruning of transient data while preserving semantic anchors that keep the model grounded.
The Memory Wall in Mobile LLM Inference
On-device language models generate tokens autoregressively, computing attention over all previous tokens at each step. The key-value cache stores these intermediate attention states to avoid redundant computation. For a 7B parameter model with 32 layers and 128-dimensional heads, each token consumes roughly 2KB in the KV cache. A 2048-token conversation therefore requires 4MB just for cached attention—before accounting for model weights, activations, or the OS.
iPhone 15 Pro allocates approximately 3GB to apps before triggering memory warnings. After loading a quantized 4-bit 7B model (3.5GB), you have perhaps 1.5GB for runtime state. With naive full-context caching, you hit the limit after 750 tokens—about five turns in a detailed technical discussion. The app either terminates or starts paging model weights to disk, destroying inference latency.
Persistent Stream: Semantic Anchors
The persistent stream holds tokens that define the model's operational context but rarely change. This includes the system prompt, user profile data, and extracted facts from earlier conversation phases. These tokens are marked immutable at creation and stored in a contiguous memory-mapped buffer.
When initializing a new conversation, the runtime pre-fills the persistent cache with these anchors. For a medical chatbot, this might be 200 tokens describing the app's clinical constraints, patient age and conditions, and medication history. The model attends to these tokens on every generation step, but their KV pairs never change. By mmap-ing this buffer from read-only storage, we avoid duplicating it in heap and can share it across conversation threads.
Key design decision: the persistent stream is append-only during a session. If the user updates their profile, we don't mutate existing entries—we append a new fact and mark the old one superseded via a tombstone bit. The attention mechanism learns to weight recent facts more heavily through position embeddings. This copy-on-write pattern prevents cache invalidation cascades that would force recomputing attention for the entire history.
Capacity and Eviction Policy
We cap the persistent stream at 512 tokens, enforced at session start. If the combined system prompt and user context exceeds this, we run an extractive summarization pass to condense biographical details and preferences. The summarizer is a separate, tiny 1B model (200MB) optimized for fact compression, not generation quality. It runs once during session init, producing a dense 300-token profile that captures 90% of the semantic information.
During long sessions, if new persistent facts are appended (e.g., user mentions a new allergy), we trigger incremental summarization. The 1B model re-encodes the persistent stream, merging redundant facts and pruning low-salience details. This happens asynchronously on a background thread while the main LLM continues generating from the old cache. When summarization completes, we atomically swap the persistent buffer and invalidate downstream attention caches. The user sees a brief 80ms stutter as the model re-attends, but context remains intact.
Ephemeral Stream: Dialogue Turns and Speculation
The ephemeral stream holds everything else: user messages, assistant responses, and speculative decode candidates. This cache is hot—tokens are added on every generation step and pruned aggressively based on recency and relevance.
We implement a circular buffer with 1536 token capacity. When full, we evict the oldest complete dialogue turn (user message + assistant response pair). Crucially, we never split a turn—either both sides are in cache or neither is. This preserves conversational coherence better than token-level sliding windows, which can orphan a question or answer fragment.
Turn-Level Attention Masks
Each turn in the ephemeral stream carries metadata: token range, timestamp, and a relevance score computed by a lightweight BERT-style encoder (50MB). When generating a new response, we compute cosine similarity between the current user query embedding and all cached turn embeddings. Turns below a 0.6 similarity threshold are masked out of attention, even if still in the buffer.
This dynamic masking means the model can "reach back" to relevant earlier turns even if newer, less relevant exchanges have pushed them deeper into the cache. In practice, a 12-turn conversation might have only 6 turns active in attention at any moment, cutting KV cache memory by 50% without losing coherence.
Speculative Branches and Pruning
Speculative decoding generates multiple candidate tokens in parallel, then validates them against a draft model. Failed candidates must be discarded, but their KV states have already been computed. In a naive implementation, these dead branches pollute the cache, consuming memory for tokens that will never appear in the final output.
Our dual-stream design treats speculative branches as ephemeral sub-streams. Each branch gets its own circular buffer, forked from the main ephemeral stream at the speculation point. When a branch is validated, we promote its tokens to the main buffer. When rejected, we simply drop the entire sub-buffer—no need to scan and remove individual entries.
This branching structure is implemented as a lock-free tree of circular buffers, where each node holds a 128-token sub-cache. Speculation typically produces 3-5 branches, so we pre-allocate a pool of 8 sub-caches (1KB each) to avoid malloc overhead. In the common case where speculation depth is 4 tokens and 2 branches are explored, we consume only 16KB of extra memory—far less than the 80KB a naive full-history cache would require for those same tokens.
Implementation: Memory Layout and Threading
Both streams are backed by memory-mapped files on iOS and Android. The persistent stream maps a read-only file created at session start. The ephemeral stream maps a read-write file with MAP_SHARED semantics, allowing the main inference thread and the background summarization thread to coordinate via atomic flags in the header.
The KV cache itself is a struct-of-arrays layout: separate contiguous buffers for keys and values, each a float16 array. This enables vectorized attention computation via NEON intrinsics on ARM. The circular buffer indices are 32-bit atomics, updated via compare-and-swap to avoid locks on the hot path.
struct EphemeralCache {
float16_t* keys; // [1536][32][128]
float16_t* values; // [1536][32][128]
atomic_uint32_t head;
atomic_uint32_t tail;
TurnMetadata turns[64];
atomic_uint32_t turn_count;
};Attention kernels read directly from these buffers without copying. The model's attention function takes a bitmask indicating which positions are valid (combining turn relevance and buffer occupancy). This mask is recomputed on every generation step, adding ~0.3ms overhead at 60fps but enabling precise control over context windows.
Real-World Performance: KidzCare and SafeChat
We deployed this architecture in two production apps. KidzCare, a speech therapy chatbot for children, runs a 3B parameter model on iPhone 14. Before dual-stream caching, sessions crashed after 8-10 minutes of continuous use (roughly 15 turns). With the new design, sessions run 45+ minutes without memory warnings, and the persistent stream's immutable profile data enables personalized feedback based on the child's speech patterns from earlier in the session.
SafeChat, a peer-to-peer encrypted messaging app with optional LLM assistance, uses dual-stream caching to maintain conversation context across multiple chat threads. The persistent stream holds the user's communication style preferences (formality, emoji usage, language), while ephemeral streams are per-thread. Switching between threads takes 12ms to swap ephemeral caches, compared to 400ms for full reinitialization in the previous architecture.
Both apps target 60fps UI responsiveness, which means token generation must complete in under 16ms. By keeping the working set (persistent + active ephemeral) under 600 tokens, we sustain 18ms average latency on A16 Bionic, with 95th percentile at 24ms. The dual-stream design's memory efficiency means we can allocate more RAM to model weights, enabling us to run a 4B model instead of 3B—improving output quality without sacrificing interactivity.
Tradeoffs and Future Directions
The primary cost of dual-stream caching is complexity. Developers must reason about two distinct memory regions with different semantics, and bugs in the circular buffer logic can cause silent context corruption. We mitigate this with extensive fuzz testing: a harness that generates random conversation patterns and validates that the model's attention masks match expected context windows.
Another limitation: the persistent stream's immutability assumption breaks down for apps where user state changes frequently. A fitness coaching app that updates daily metrics would thrash the persistent cache with append operations. For these use cases, a hybrid approach works better—store only truly static data (system prompt) in the persistent stream, and treat user state as high-priority ephemeral data with custom eviction logic.
Looking ahead, integrating this architecture with mixture-of-experts models is promising. Each expert could maintain its own ephemeral cache, sharing a common persistent stream. The router would select which expert's cache to attend to based on the current query, enabling even more aggressive memory sharing across specialized sub-models. Early experiments suggest this could support 10B parameter MoE models on mobile devices with the same memory budget as current 4B dense models.