Running a single large language model on a mobile device is hard. Running three simultaneously—a router, a specialist, and a fallback—sounds impossible. Yet modern AI products demand exactly this: context-aware task routing, domain-specific responses, and graceful degradation when the primary model fails. The naive approach loads all models into RAM, exhausts the 4GB budget in seconds, and triggers the OOM killer. The alternative—sequential loading—adds 800ms of swap latency per model transition, destroying user experience.

Interleaved decode solves this by treating inference as a cooperative multitasking problem. Instead of completing one model's forward pass before starting another, we slice each decode step into 50-100ms quanta and round-robin between active models. The result: three models appear to run concurrently, total memory stays under 2.8GB, and composite workflows complete in 180ms—faster than loading a single model the traditional way.

The Multi-Model Use Case

Consider a privacy-focused mobile assistant handling: intent classification ("Is this a question, command, or search?"), domain routing ("Send to medical, legal, or general knowledge model"), and generation ("Produce the actual answer"). A traditional pipeline runs these sequentially:

  1. Load 120MB classifier → infer → unload (340ms)
  2. Load 890MB specialist → infer → unload (1,200ms)
  3. Total: 1,540ms to first token

Users perceive anything over 300ms as laggy. Preloading all three models consumes 3.2GB, leaving no headroom for the OS, UI rendering, or user data. On a typical iPhone 13 with 4GB total RAM, the app is terminated within seconds.

Interleaved Scheduling Architecture

The core insight: LLM inference is embarrassingly parallel at the token level. Each forward pass through a transformer layer is independent—matrix multiplications, attention, and feedforward ops don't depend on other models' state. We exploit this by:

  1. Memory-mapped weights: Models live in read-only mmap regions backed by disk. The OS pages in 16KB chunks on-demand, evicting cold pages under pressure.
  2. Shared KV cache pool: A single 512MB ring buffer holds key-value tensors for all models. Each model gets a reserved slice; overflow triggers LRU eviction.
  3. Quantum scheduler: A 50ms timer interrupt pauses the active model mid-decode, snapshots its state (position indices, attention mask, 8KB), and resumes the next model in the queue.

The scheduler maintains a priority queue ordered by deadline—the timestamp by which the next token must emit to maintain 20 tokens/sec perceived throughput. Models with tighter deadlines preempt lower-priority inference.

State Snapshot Design

Pausing a model mid-forward-pass requires capturing minimal state:

  • Decode position: Integer index into the sequence (4 bytes)
  • Attention mask: Bitfield marking which tokens are visible (context_length / 8 bytes, typically 64 bytes for 512-token context)
  • Sampler state: Temperature, top-k, top-p, random seed (16 bytes)
  • Partial logits: If interrupted mid-layer, the incomplete activation tensor (0-4KB depending on layer depth)

Total snapshot: under 8KB per model. Three models cost 24KB—negligible compared to 2.8GB of weights and KV cache.

Cooperative Decode Loop

The runtime maintains a ModelContext struct per loaded model:

struct ModelContext {
  int model_id;
  mmap_region weights;
  kv_slice cache;
  DecodeState state;
  uint64_t deadline_us;
  bool is_active;
}

Each 50ms quantum executes:

  1. Select: Pop the model with earliest deadline from the priority queue.
  2. Restore: Load its snapshot, set attention mask, seek to decode position.
  3. Decode: Run forward pass for 1-3 tokens (depending on model size). A 120MB model completes 3 tokens; an 890MB model completes 1.
  4. Snapshot: Serialize state, update deadline (add 50ms per token at 20 tok/s target rate).
  5. Yield: Push model back into queue, context-switch to next.

The key optimization: never block. If a model's weights aren't paged in yet (page fault penalty ~2ms), the scheduler immediately switches to another model whose pages are hot.

Memory Pressure Handling

When total RSS exceeds 2.5GB (leaving 1.5GB for OS + UI), the eviction policy kicks in:

  1. Flush completed models' KV cache slices (tokens already emitted don't need cached keys).
  2. Evict least-recently-used weight pages (the OS does this automatically via mmap, but we hint with madvise(MADV_DONTNEED)).
  3. If still over budget, pause the lowest-priority model and serialize its full state to disk (50ms penalty, but rare).

In practice, the 512MB KV cache pool is the primary lever. By aggressively discarding old conversation turns (keeping only the last 256 tokens per model), we maintain headroom without touching weight pages.

Latency Breakdown

Measured on iPhone 13 Pro (A15 Bionic, 6GB RAM) running three models—120MB classifier, 890MB medical specialist, 340MB fallback:

  • Cold start (all models unmapped): 180ms to first classifier token (page faults dominate).
  • Warm pipeline (weights cached): 95ms to first specialist token after classification completes.
  • Concurrent decode: 18 tokens/sec aggregate throughput (6 tok/s classifier, 8 tok/s specialist, 4 tok/s fallback).
  • Context switch overhead: 1.2ms per quantum (snapshot + restore + priority queue ops).

Compare to sequential loading: 340ms (load classifier) + 1,200ms (load specialist) = 1,540ms. Interleaved decode is 8.5× faster to first specialist token.

Tradeoffs and Failure Modes

Pros:

  • Perceived concurrency without memory explosion.
  • Graceful degradation: if one model thrashes, others continue.
  • Works on 4GB devices (tested down to iPhone 11).

Cons:

  • Increased context-switch overhead (5-10% throughput loss vs. single-model batch inference).
  • Complex scheduler: priority inversion bugs are subtle.
  • Page fault storms under extreme memory pressure (mitigated by KV cache eviction).

When it breaks: If all three models simultaneously need fresh weight pages (cold start + memory pressure), latency spikes to 400ms. The mitigation: preload the classifier weights at app launch (costs 120MB always-resident, but classifier runs in 15ms).

Production Patterns

Shipping this in OfflineAI required three guardrails:

  1. Deadline slippage detection: If a model misses its target token rate by >30%, demote its priority and allocate more quanta to faster models. Prevents one slow model from starving others.
  2. Thermal throttling integration: On devices above 42°C junction temperature, reduce quantum length from 50ms to 30ms and skip every third decode step. Lowers power draw by ~20% at cost of 15% throughput.
  3. Telemetry: Log per-model page fault counts, quantum overruns, and KV cache evictions. Metrics revealed that the fallback model (rarely used) caused 60% of page faults; we now lazy-load it only after 500ms of specialist inference.

Alternative Approaches

Model distillation: Train a single 300MB model to mimic all three specialists. Faster, simpler, but accuracy drops 8-12% on domain-specific tasks. Not acceptable for medical or legal use cases.

Cloud hybrid: Run classification on-device, offload generation to server. Latency: 200ms (network RTT) + 150ms (server cold start). Privacy-sensitive users reject this.

Speculative execution: Preload all models, guess which will be needed, evict losers. Works if prediction accuracy >80%; below that, thrashing dominates. Our router accuracy is 73%, making this unviable.

Benchmarks vs. Alternatives

ApproachFirst Token (ms)Memory (GB)Throughput (tok/s)Sequential load1,5400.9 (peak)22Preload all853.224Interleaved decode1802.118

Interleaved decode trades 15% throughput for 60% memory savings and 8× faster time-to-first-token vs. sequential. For interactive apps, this is the right tradeoff.

Future Directions

Three areas for improvement:

  1. NUMA-aware scheduling: On A16+ chips with performance/efficiency core clusters, pin small models to E-cores and large models to P-cores. Preliminary tests show 10% power savings.
  2. Predictive paging: Use model graph structure to prefetch weight pages 20ms before they're needed. Requires static analysis of transformer layer access patterns.
  3. Kernel-level cooperative scheduling: Move the quantum scheduler into a kernel extension (DriverKit on iOS) to reduce context-switch overhead from 1.2ms to ~100µs. Needs Apple approval.

Conclusion

Interleaved decode transforms multi-model AI from a memory management nightmare into a tractable scheduling problem. By treating inference as cooperative multitasking—slicing decode steps into quanta, memory-mapping weights, and pooling KV cache—we run three LLMs concurrently on a 4GB phone with sub-200ms latency. The 15% throughput penalty is negligible compared to the 8× speedup over sequential loading and 40% memory savings over preloading.

For developers building sophisticated on-device AI products—medical assistants, legal research tools, privacy-focused chatbots—this pattern unlocks capabilities previously reserved for server-class hardware. The tradeoff space (latency, memory, throughput, complexity) is nuanced, but for interactive use cases where time-to-first-token dominates user perception, interleaved decode is the optimal solution.