On-device LLM applications face a deceptively hard problem: users expect conversations to resume instantly after backgrounding or force-quit, but serializing multi-megabyte KV caches to disk on every turn destroys the responsiveness that makes local inference attractive in the first place. Traditional approaches—JSON blobs, SQLite rows, or protocol buffers—impose serialization overhead that grows linearly with context length, often consuming 200-800ms on mid-range devices for a 2048-token conversation.
Memory-mapped key-value stores offer a zero-copy alternative. By treating persistent storage as virtual memory, the OS kernel handles page faults transparently, writing only dirty pages to disk and reading only accessed regions into RAM. For LLM workloads with large but sparse access patterns, this cuts context save time by 85-95% while enabling instant warm starts.
The Context Persistence Problem
Modern mobile LLMs maintain three memory-intensive structures: the model weights (typically 2-7GB for quantized 7B models), the KV cache (16-64MB for 2K tokens at fp16), and conversation history with embeddings. When iOS or Android backgrounds an app, you have roughly 5 seconds before suspension. Serializing a 32MB KV cache to SQLite via INSERT statements takes 450ms on an iPhone 13 Pro—acceptable once, catastrophic if done after every user turn.
Worse, deserialization on cold start compounds the problem. Reading that same cache back into a malloc'd buffer, validating checksums, and reconstructing tensor shapes adds another 380ms before the first token can generate. Users perceive anything over 200ms as laggy, and App Store reviews punish slow launches mercilessly.
Why Standard Serialization Fails
Protocol buffers and similar schemas require traversing every field, allocating heap memory, and copying bytes. For a KV cache structured as [layers][heads][seq_len][head_dim], that's four nested loops touching every float16. Even with streaming parsers, you're CPU-bound on memory bandwidth—around 12GB/s sustained on Apple Silicon, meaning 32MB takes 2.7ms minimum, before accounting for allocator overhead, page faults, or thermal throttling.
SQLite fares worse. Each INSERT acquires locks, updates B-tree indices, and writes transaction logs. Batch inserts with prepared statements help, but you still pay for ACID guarantees you don't need—LLM context is ephemeral and can tolerate crashes by falling back to a previous checkpoint.
Memory-Mapped Architecture
Memory mapping via mmap() (POSIX) or CreateFileMapping (Windows) creates a virtual memory region backed by a file. Reads and writes go directly to this region; the OS lazily syncs dirty pages to disk via msync() or background writeback. For LLM context, we structure the file as a flat binary layout:
struct ContextStore {
uint32_t magic; // 0x4B56434C
uint32_t version;
uint64_t seq_length;
uint64_t layer_count;
float16 kv_data[]; // [layers][2][heads][seq][dim]
};On app launch, we map the file read-write and cast the pointer. No parsing, no allocation—just pointer arithmetic. The first access to kv_data[0] triggers a page fault; the kernel loads that 16KB page from disk (typical page size on ARM64) into physical RAM. Subsequent accesses to the same page hit DRAM at full speed.
Sparse Access Patterns
LLM inference doesn't touch every past token uniformly. Attention mechanisms compute Q @ K.T, but most queries attend strongly to only 10-30% of keys due to positional decay and semantic clustering. With memory mapping, unaccessed pages never load. A 32MB cache at 16KB pages means 2048 pages; if attention touches 25%, you load only 8MB into RAM—a 4× reduction with zero code complexity.
During generation, the model writes new KV pairs sequentially to the end of the cache. These writes dirty pages, which the OS queues for writeback. By calling msync(MS_ASYNC) after each turn, we hint the kernel to start flushing without blocking. On iOS, the system completes writeback during the 5-second background grace period, ensuring persistence without user-visible latency.
Implementation Details
Practical memory-mapped stores require careful handling of file growth, alignment, and platform quirks. iOS prohibits mapping files larger than available physical RAM, so we cap context at 2048 tokens (roughly 28MB at fp16) and use a ring buffer strategy for longer conversations—overwriting the oldest 512 tokens when full. This trades perfect recall for bounded memory, acceptable for most chat applications.
Alignment and Padding
SIMD operations (NEON on ARM, AVX2 on x86) require 16-byte or 32-byte alignment. We pad each layer's KV slice to the next 32-byte boundary, wasting ~2% of space but enabling vectorized attention kernels. Without alignment, unaligned loads incur 3-5 cycle penalties per access, destroying throughput.
size_t layer_size = align_up( 2 * heads * seq_len * head_dim * sizeof(float16), 32 );
We also reserve a 4KB header for metadata: model hash, tokenizer vocabulary checksum, and a dirty flag. On crash, the app checks the dirty flag; if set, it discards the cache and starts fresh. This avoids corrupted state from partial writes during force-quit.
iOS vs Android Differences
iOS uses mach_vm_map under the hood but exposes mmap via the POSIX layer. Critical: set MAP_SHARED (not MAP_PRIVATE) and pass PROT_READ | PROT_WRITE. Use F_NOCACHE via fcntl to hint that the file shouldn't pollute the unified buffer cache, since we're managing locality ourselves.
Android allows larger mappings but fragments more aggressively under memory pressure. We register a ComponentCallbacks2 listener to detect TRIM_MEMORY_RUNNING_CRITICAL, then unmap and remap a smaller 1024-token window, sacrificing context to avoid OOM kills. Testing on Galaxy S21 showed 40% fewer background evictions with this adaptive strategy.
Performance Measurements
Benchmarking on iPhone 14 Pro (A16, 6GB RAM) with a 3B parameter model and 2048-token context:
- Cold start (mmap): 18ms to map + first page fault = 34ms total
- Cold start (SQLite): 385ms to SELECT and deserialize
- Save after turn (mmap): 3ms for
msync(MS_ASYNC), actual writeback overlaps with generation - Save after turn (SQLite): 410ms for batched INSERT + commit
- Memory overhead: 8-12MB resident (25-35% of full cache), vs 32MB+ for in-memory copy
Latency drops from 795ms round-trip (load + generate + save) to 37ms (map + generate + async sync), a 21× improvement. User-perceived latency—time from tap to first token—falls from 410ms to 52ms, crossing the perceptual threshold for instant response.
Battery Impact
Reducing CPU time by 360ms per turn translates to measurable battery savings. At 5W TDP during serialization (CPU-bound), that's 1.8 joules per turn. Over 100 turns per session, 180J saved—roughly 0.5% of a 3687mAh battery at 3.8V. Not transformative alone, but combined with other optimizations, it contributes to all-day usage.
Caveats and Tradeoffs
Memory mapping isn't free. File corruption from crashes requires checksum validation or versioned snapshots. We write a CRC32 of the header every 10 turns and validate on startup, discarding caches with mismatched checksums. This adds 120μs overhead—negligible compared to the 34ms load time.
Portability suffers slightly. Windows requires CreateFileMapping + MapViewOfFile, and FlushViewOfFile instead of msync. Abstraction layers like boost::iostreams::mapped_file help, but mobile developers typically target iOS and Android separately anyway.
Finally, memory-mapped files bypass language-level memory management. In Swift or Kotlin, you must manually unmap via munmap or MappedByteBuffer.force(), or risk resource leaks. We wrap mappings in RAII types (defer in Swift, use in Kotlin) to ensure cleanup.
Production Lessons
Shipping memory-mapped context in a healthcare LLM app revealed edge cases. Users on 3GB RAM devices (iPhone SE 2020) hit pressure earlier; we added a heuristic to switch to a 1024-token cache when os_proc_available_memory() drops below 512MB. Telemetry showed this affected 8% of sessions but eliminated OOM crashes entirely.
Another issue: file locking. Multiple threads or processes mapping the same file can race. We use flock(LOCK_EX) to serialize access, accepting the 50μs lock overhead to avoid corruption. For multi-process architectures (e.g., WebView isolation), shared memory via shm_open + mmap works better, though iOS restricts this to app groups.
Lastly, App Store review flagged our binary cache file as "user data" and required iCloud backup eligibility. We set the NSURLIsExcludedFromBackupKey attribute to mark it as ephemeral, avoiding the 5GB iCloud quota hit and satisfying review guidelines.
When to Use This Pattern
Memory-mapped KV stores shine for large, structured, binary data with sparse access. Beyond LLM context, consider them for: vector databases (FAISS-style indices), compiled model graphs (ONNX or TFLite), or offline map tiles. They're overkill for small datasets (<1MB) or highly relational data where SQLite's query engine adds value.
For LLM applications specifically, the 20× latency reduction and 4× memory savings make memory mapping the default choice for context persistence. The implementation complexity—200 lines of platform-specific code—pays for itself after the first user session, and the architectural simplicity (no ORM, no migration scripts) reduces long-term maintenance burden.