Memory-Mapped Model Weights: iOS LLM Loading

The Cold Start Problem in Mobile LLMs

Shipping a 1.2GB quantized language model in a mobile app presents an initialization bottleneck that traditional file I/O cannot solve gracefully. Reading the entire weight tensor binary into heap memory during app launch consumes 6–8 seconds on mid-tier devices, burns through battery during disk reads, and risks memory pressure warnings from iOS when resident memory spikes above 1.5GB.

The standard approach—Data(contentsOf: modelURL) in Swift or fread() in C—forces the kernel to copy every byte from flash storage into application heap before inference can begin. For a 7B parameter model quantized to 4-bit (roughly 3.5GB on disk, 1.2GB after compression), this synchronous read blocks the main thread and degrades perceived app responsiveness.

Memory-mapped files offer an elegant alternative: the operating system maps the file directly into the process's virtual address space without performing upfront I/O. Pages are loaded on-demand as the inference engine accesses weight tensors, transforming a monolithic 8-second read into a series of sub-millisecond page faults distributed across the first few inference passes.

mmap() Mechanics for Model Weights

The POSIX mmap() system call establishes a mapping between a file descriptor and a region of virtual memory. When you request a pointer to a 1.2GB model file, the kernel allocates virtual address space but does not immediately read from disk. Instead, it marks those pages as not present in the page table.

On first access to a weight tensor—say, the embedding layer at offset 0x400000—the CPU triggers a page fault. The kernel's fault handler reads the corresponding 16KB page from disk into physical RAM, updates the page table entry, and resumes execution. Subsequent accesses to that page proceed at DRAM speed with no additional I/O.

In Swift, the pattern looks like this:

let fd = open(modelPath, O_RDONLY)
guard fd >= 0 else { throw ModelError.fileNotFound }
defer { close(fd) }

var st = stat()
guard fstat(fd, &st) == 0 else { throw ModelError.statFailed }
let fileSize = Int(st.st_size)

let ptr = mmap(nil, fileSize, PROT_READ, MAP_PRIVATE, fd, 0)
guard ptr != MAP_FAILED else { throw ModelError.mmapFailed }
defer { munmap(ptr, fileSize) }

let buffer = UnsafeRawBufferPointer(start: ptr, count: fileSize)
// Pass buffer to ONNX Runtime or llama.cpp

The MAP_PRIVATE flag ensures writes (if any) remain in process memory without modifying the underlying file. For read-only inference, this is ideal: the kernel can share physical pages across multiple processes running the same model, reducing system-wide memory pressure.

Lazy Loading and Page Fault Latency

The first inference pass after mmap() incurs a burst of page faults as the model traverses embedding, attention, and FFN layers. On an iPhone 13 Pro with NVMe storage, each 16KB page fault resolves in 50–150 microseconds. A 1.2GB model spans roughly 76,800 pages; touching every page sequentially would still cost ~4 seconds.

However, inference engines like llama.cpp and ONNX Runtime exhibit spatial locality: they process weight matrices in contiguous blocks. The first token generation might fault 8,000 pages (128MB of embeddings and attention heads), but subsequent tokens reuse those pages from the OS page cache. By the third or fourth token, fault rates drop below 1% of memory accesses.

Measured on a 1.1GB Llama-2-7B model quantized to Q4_K_M:

Traditional read: 7,800ms to load, 1.2GB heap allocation, 340ms first-token latency
mmap(): 180ms to map, 0 bytes heap at init, 420ms first-token latency (includes page faults), 290ms second-token latency

The 80ms first-token penalty amortizes across the session. For multi-turn conversations or batch inference, mmap() delivers net wins in both latency and memory efficiency.

Alignment and Quantization Formats

Memory-mapped tensors must respect CPU alignment requirements. ARM64 NEON instructions operate on 128-bit vectors; misaligned loads trigger expensive fixup traps. Quantized formats like GGUF (used by llama.cpp) enforce 32-byte alignment for all tensor blocks, ensuring that mmap() pointers land on cacheline boundaries.

When serializing models for mmap(), pad each tensor's offset to the next 32-byte boundary:

func alignedOffset(_ offset: Int, alignment: Int = 32) -> Int {
    return (offset + alignment - 1) & ~(alignment - 1)
}

GGUF files embed a header with tensor metadata (name, shape, dtype, offset). The loader parses this header (typically