The Mobile ML Memory Wall

Modern mobile ML applications face a brutal constraint: peak memory usage determines whether your app survives in the background or gets killed by the OS. A typical on-device LLM inference pipeline might allocate tensors for embeddings (512×4096 floats = 8MB), attention matrices (batch×heads×seq×seq = 64MB for 2048 tokens), and intermediate activations (another 40MB). Add model weights (200MB+ for a 1B parameter model) and you're pushing 300MB before the user sees a single token.

The naive approach materializes every tensor upfront. The result: memory spikes that trigger OOM kills on mid-range Android devices, especially when the user switches apps mid-inference. After shipping HearingAid Pro—which runs real-time DSP alongside speech recognition—the lesson was clear: you cannot afford to allocate what you might not use.

Lazy Allocation: Defer Until Access

Lazy tensor materialization inverts the allocation model. Instead of allocating a 64MB attention matrix at graph construction, you allocate a descriptor—a lightweight struct holding shape, dtype, and a factory function. The backing memory is allocated only when the tensor is first read or written.

In practice, this looks like a two-tier system. The descriptor lives in a registry (a flat array or hash map, typically