Memory-Mapped LLM Inference: iOS mmap() Deep Dive

Shipping large language models on mobile devices forces hard tradeoffs between model size, inference speed, and memory footprint. A 7B parameter quantized model weighs 4–5GB on disk. Loading that into RAM naively—reading the file, deserializing weights, copying to inference buffers—can take 8+ seconds on an iPhone 12 and exhaust available memory, triggering system kills.

Memory mapping via mmap() offers a radically different approach: map the model file directly into the process address space without upfront reads. The kernel loads pages on-demand as the inference engine touches them. This article dissects how memory-mapped LLM inference works on iOS, the architectural constraints that make it viable, and the performance wins we measured shipping OfflineAI with a 7B Llama model.

Why Traditional File I/O Fails for Multi-GB Models

Standard file reading (FileHandle.readDataToEndOfFile() or fread()) copies bytes from disk into userspace buffers. For a 4.2GB GGUF file, this means:

Allocating 4.2GB of contiguous virtual memory
Reading 4.2GB sequentially from flash storage (~3–6s on modern iPhones)
Parsing the GGUF header and tensor metadata
Copying weight tensors into Metal buffers or CPU arrays

Peak memory usage doubles: the file buffer plus inference buffers. On a 4GB device, iOS kills the process before inference starts. Even on 6GB devices, the delay frustrates users—8 seconds staring at a spinner to answer a prompt.

Lazy Loading Is Not Enough

Chunked reads (load tensors as needed) reduce peak memory but don't solve latency. The first token still waits for attention weights and embedding tables—often 60–70% of model size. Prefetching introduces complexity: predict which tensors to load, manage eviction, handle cache misses. For on-device inference where models rarely change, this engineering cost outweighs benefits.

Memory Mapping: Kernel-Managed Paging

POSIX mmap() maps a file into the process address space without reading it. The kernel reserves virtual memory but doesn't load pages until the process accesses them. When inference code reads a weight tensor, a page fault triggers, the kernel loads the 16KB page from disk, and execution resumes. Subsequent accesses hit the page cache—no disk I/O.

iOS Implementation with GGUF Files

GGUF (GPT-Generated Unified Format) stores tensors contiguously after a header. Each tensor's offset is known upfront. We map the entire file read-only:

let fd = open(modelPath, O_RDONLY)
let fileSize = lseek(fd, 0, SEEK_END)
lseek(fd, 0, SEEK_SET)
let ptr = mmap(nil, fileSize, PROT_READ, MAP_PRIVATE, fd, 0)
if ptr == MAP_FAILED { /* handle error */ }
close(fd) // fd can close; mapping persists

The MAP_PRIVATE flag ensures copy-on-write semantics (we never write). The file descriptor closes immediately—the mapping holds a reference to the vnode. Now ptr is a raw pointer to 4.2GB of virtual memory backed by the model file.

Parsing Headers Without Loading Data

GGUF headers are tiny (tens of KB): magic bytes, version, tensor count, metadata. Reading the header faults in a few pages. We parse tensor names, shapes, offsets, and quantization formats without touching weight data:

struct GGUFTensor {
  let name: String
  let offset: Int
  let shape: [Int]
  let type: GGMLType // Q4_0, Q8_0, F16, etc.
}

Tensor offsets are relative to the mapping base. When the inference engine needs the model.layers.0.attention.wq tensor, we compute ptr + offset and cast to the appropriate type. The first access faults in pages; subsequent tokens hit cache.

Performance Wins: Launch Time and Memory

We benchmarked a 7B Llama 2 model (4.2GB Q4_0 quantized) on iPhone 12 (4GB RAM) and iPhone 14 Pro (6GB RAM). Metrics: cold launch (app not in memory) to first token generated.

Cold Launch Latency

Read + deserialize: 8.1s (iPhone 12), 6.9s (iPhone 14 Pro)
mmap + lazy load: 340ms (both devices)

The mmap approach is 24× faster. The 340ms includes app initialization, GGUF header parsing, and Metal command buffer setup. First token generation triggers page faults for embedding tables and initial attention layers—about 600MB of actual data loaded. Subsequent tokens see sub-100ms latency as pages remain cached.

Memory Footprint

Traditional loading: ~8GB peak (4.2GB file buffer + 3.8GB inference buffers). Memory-mapped: ~4.5GB peak (no file buffer; only inference buffers + faulted pages). The kernel pages out unused model pages under pressure, keeping the app alive on 4GB devices.

Warm Launch

If the app was recently active, iOS keeps pages in the unified buffer cache. Warm launch drops to 180ms—just app initialization and header parsing. No disk I/O. This is critical for background inference or Siri integration where the model must respond instantly.

Architectural Constraints and Tradeoffs

File Format Matters

Memory mapping works because GGUF is append-only and alignment-friendly. Tensors are 32-byte aligned; no pointer fixup needed. Formats requiring deserialization (PyTorch .pth, TensorFlow SavedModel) don't benefit—you still parse and copy. GGUF and ONNX (with external data) are designed for this.

Page Fault Latency

Each unique 16KB page costs ~1–2ms to fault in from flash. A 4.2GB model spans 262,000 pages. If inference touched every page, total fault time would be 262–524 seconds. In practice, autoregressive generation accesses = 0 else { throw ModelError.openFailed } defer { close(fd) } size = lseek(fd, 0, SEEK_END) lseek(fd, 0, SEEK_SET) ptr = mmap(nil, size, PROT_READ, MAP_PRIVATE, fd, 0) guard ptr != MAP_FAILED else { throw ModelError.mmapFailed } try parseHeader() } private func parseHeader() throws { // Read magic, version, tensor count from ptr // Populate tensors dictionary with offsets } func tensorData(name: String) -> UnsafeRawPointer? { guard let tensor = tensors[name] else { return nil } return ptr.advanced(by: tensor.offset) } deinit { munmap(ptr, size) } }

Inference code requests tensors by name; the first access faults pages in. No explicit loading logic.

Lessons from Production

Shipping OfflineAI with mmap'd models taught us:

Validate early: Check GGUF magic bytes before mapping. Corrupted downloads crash hard if you map garbage.
Monitor page faults: Instruments' Virtual Memory trace shows fault patterns. We discovered excessive faults from unaligned Metal buffer reads and fixed alignment.
Test on old devices: iPhone X (3GB RAM) is the floor. Memory mapping kept us viable there; traditional loading failed.
Prefetch strategically: madvise(MADV_WILLNEED) on embedding tables during app launch (while showing UI) hides fault latency for the first prompt.

When Not to Use Memory Mapping

If your model updates frequently (fine-tuning on-device), memory mapping complicates writes. You'd need copy-on-write or separate write buffers. For models