Running a 7B parameter language model on a mobile device with 4GB of total RAM—shared across OS, UI, and background services—demands ruthless memory discipline. The naive approach of loading all weights into memory at startup consumes 3.5GB for FP16 and 1.75GB for INT8, leaving no headroom for KV cache, activation buffers, or the application itself. Tile-based inference, borrowed from graphics rendering pipelines, solves this by spatially partitioning the model and streaming weights on-demand.

The Memory Wall in Mobile Transformers

A 7B transformer typically comprises 32 decoder layers, each containing a multi-head attention block and a feed-forward network. In standard inference, all weight matrices—query, key, value projections, output projection, FFN up-projection, down-projection—reside in memory simultaneously. For a single layer at FP16 precision:

  • Attention weights: 4 × (4096 × 4096) × 2 bytes = 134MB
  • FFN weights: 2 × (4096 × 11008) × 2 bytes = 180MB
  • Total per layer: ~314MB
  • 32 layers: ~10GB uncompressed

INT8 quantization halves this to 5GB, still untenable. INT4 brings it to 2.5GB but degrades quality on reasoning tasks. The real constraint isn't peak capacity—it's sustained allocation during inference when activations, KV cache, and OS overhead compete for the same budget.

Spatial Partitioning: The Tile Abstraction

Tile-based inference divides the model into logical tiles—subsets of layers or attention heads—that load, execute, and evict sequentially. Unlike layer-wise streaming (which still requires full layer residency), tiles operate at sub-layer granularity. For a 32-layer model, we partition into 8 tiles of 4 layers each. During forward pass:

  1. Load Tile 0 (layers 0-3) into a 512MB working buffer
  2. Process input through these layers, accumulating activations
  3. Evict Tile 0, load Tile 1 (layers 4-7)
  4. Repeat until output layer

Peak memory becomes: 512MB (tile) + 256MB (KV cache) + 128MB (activations) + 128MB (overhead) = 1024MB, a 70% reduction versus full-model residency. The tradeoff: inference latency increases by 2-3× due to weight loading overhead.

Implementation: Memory-Mapped Weight Files

We store quantized weights in a custom binary format with 4KB-aligned tile boundaries. Each tile header encodes layer indices, tensor shapes, and file offsets. The inference engine uses memory-mapped I/O (mmap on POSIX, CreateFileMapping on Windows) to avoid explicit read() calls:

struct TileHeader {
  uint32_t layer_start;
  uint32_t layer_end;
  uint64_t weight_offset;
  uint64_t weight_size;
  uint8_t quant_scheme; // 0=INT8, 1=INT4
};

void* map_tile(int fd, TileHeader* hdr) {
  return mmap(NULL, hdr->weight_size,
              PROT_READ, MAP_PRIVATE,
              fd, hdr->weight_offset);
}

Memory-mapped regions are demand-paged by the OS; unused pages never hit physical RAM. On iOS, we advise the kernel with madvise(MADV_SEQUENTIAL) to optimize prefetch. Android's ashmem provides similar semantics with explicit pinning control.

Asynchronous Tile Prefetch

To hide I/O latency, we prefetch Tile N+1 while processing Tile N. A dedicated thread issues posix_fadvise(POSIX_FADV_WILLNEED) 200ms before the tile boundary. On NVMe storage (iPhone 14 Pro), this reduces tile swap time from 85ms to 12ms. On eMMC (mid-range Android), gains are modest—45ms to 38ms—but still meaningful for interactive latency.

Activation Checkpointing Within Tiles

Standard gradient checkpointing recomputes activations during backprop to save memory. In inference-only tile mode, we apply a variant: only the final activation of each tile persists in the inter-tile buffer. Intermediate activations within a tile are stack-allocated and discarded after the tile completes. For a 4-layer tile with hidden size 4096:

  • Without checkpointing: 4 × 4096 × batch_size × 2 bytes retained
  • With checkpointing: 1 × 4096 × batch_size × 2 bytes retained

At batch_size=1 (typical for on-device), this saves 24KB per tile—minor individually, but 192KB across 8 tiles.

KV Cache Tiling for Long Contexts

Attention's KV cache grows linearly with context length. At 2048 tokens, 32 layers, 32 heads, and FP16 precision, the cache consumes 32 × 2048 × 2 × 4096 × 2 bytes = 1GB. Tile-based KV cache splits the cache across tiles, storing only keys/values for layers currently resident. When Tile 0 evicts, its KV entries spill to disk via a circular buffer:

struct KVTileCache {
  FILE* spillfile;
  uint64_t spill_offset[NUM_TILES];
  void* resident_kv; // 128MB buffer
};

void evict_kv(KVTileCache* cache, int tile_id) {
  fseek(cache->spillfile, cache->spill_offset[tile_id], SEEK_SET);
  fwrite(cache->resident_kv, 128MB, 1, cache->spillfile);
}

On re-entry, the tile reloads its KV partition. For single-turn inference, we skip spilling entirely—only multi-turn chat needs persistent cache.

Quantization-Aware Tile Boundaries

INT4 quantization packs two weights per byte. Naive tile boundaries mid-layer can split a byte, forcing unpacking overhead. We align tile cuts to 64-weight boundaries (32 bytes), ensuring each tile begins on a cache-line boundary. This eliminates 15% of unpacking stalls measured on ARM Cortex-A78.

Mixed-Precision Tiles

Not all layers tolerate aggressive quantization equally. Empirically, the first and last 4 layers degrade sharply below INT8, while middle layers handle INT4 well. We encode this in the tile manifest:

TileHeader tiles[8] = {
  {0, 3, offset0, size0, QUANT_INT8},  // Input layers
  {4, 7, offset1, size1, QUANT_INT4},
  // ...
  {28, 31, offset7, size7, QUANT_INT8} // Output layers
};

The inference loop selects dequantization kernels per-tile. INT8 tiles consume 2× memory of INT4 but preserve quality where it matters.

Real-World Performance: OfflineAI Chat

In OfflineAI, a privacy-focused on-device chat app, we deployed tile-based inference for a 7B LLaMA-derived model. Test device: iPhone 13 (4GB RAM), iOS 17. Metrics over 50 chat turns:

  • Peak memory: 980MB (vs. 3.2GB baseline)
  • First-token latency: 1.8s (vs. 0.6s baseline)
  • Tokens per second: 4.2 (vs. 11.3 baseline)
  • Background eviction events: 0 (vs. 18 baseline)

The 2.7× throughput penalty is acceptable for use cases prioritizing availability over speed—users tolerate slower responses if the app never crashes. Battery impact increased 12% due to sustained I/O, mitigated by caching tiles in a 256MB LRU pool for repeated prompts.

Tradeoffs and Future Directions

Tile-based inference sacrifices throughput for memory stability. It's a poor fit for batch processing or high-frequency inference. Ideal scenarios: single-user assistants, offline translation, accessibility tools where responsiveness matters less than reliability. Emerging optimizations include:

  • Hierarchical tiling: Nested tiles (layer → head → weight block) for finer granularity
  • Speculative tile loading: Predict next tile based on prompt patterns
  • Hybrid execution: Run first/last layers in-memory, tile middle layers

As mobile DRAM capacities plateau, spatial partitioning techniques will become standard. The challenge shifts from "can we run this model?" to "how do we architect inference to coexist with the rest of the system?" Tile-based inference is one answer—a deliberate compromise that keeps AI capabilities accessible on constrained hardware.