The Multi-Model Reality

Modern mobile AI applications rarely rely on a single large language model. A speech therapy app might combine a pronunciation scorer, a grammar checker, and a conversational tutor. A clinical assistant might run symptom extraction, drug interaction lookup, and patient education generation. Each task benefits from a specialized model, but loading three 1.2GB models simultaneously on a device with 4GB of user-accessible RAM is a recipe for OOM crashes and 15-second response times.

The naive approach—load all models at launch—fails spectacularly. iOS terminates apps exceeding memory limits with zero warning. Android's low-memory killer is only slightly more forgiving. Even if you survive the initial load, inference with multiple active models triggers thermal throttling within 90 seconds, cutting performance by 60% and draining 8% battery per minute.

Interleaved execution solves this by treating models as time-sliced resources. Only one model occupies GPU/NPU at any moment. Models are loaded just-in-time, evicted aggressively, and scheduled to minimize context switches. The result: three models running in 1.6GB with 70% accuracy, preload it during idle GPU time. A speech therapy app knows that after pronunciation scoring, 85% of users request the conversational tutor within 3 seconds. Load the tutor model immediately after scoring completes, while the UI displays results.

Prediction sources include user behavior logs, state machine transitions, and time-of-day patterns. Track model invocation sequences in analytics. If model B follows model A in 80% of sessions, preload B when A starts. For state-driven apps (onboarding flows, multi-step forms), the next state deterministically requires a specific model.

Preloading risks wasting memory if the prediction is wrong. Implement a 2-second timeout: if the preloaded model isn't used, evict it. Monitor prediction accuracy weekly and retrain your transition matrix. In a production speech app, predictive preloading reduced P95 latency from 680ms to 210ms, cutting perceived lag by 70%.

Memory Management

Model weights dominate memory consumption. A quantized 1.1GB Llama 2 7B model occupies 1.1GB of RAM—no compression, no paging. The iOS kernel will not swap model pages to disk; if you exceed the memory limit, you crash.

Explicit Session Lifecycle

Never rely on garbage collection to free model memory. Explicitly destroy sessions immediately after inference. In Swift with Core ML:

var model: MLModel? = nil

func runInference(input: MLFeatureProvider) async -> MLFeatureProvider {
    model = try await MLModel.load(contentsOf: modelURL)
    let output = try model!.prediction(from: input)
    model = nil  // Explicit release
    return output
}

The nil assignment triggers immediate deallocation. Without it, ARC may defer release until the next autorelease pool drain, leaving 1.1GB allocated for seconds. In Kotlin with ONNX Runtime, call session.close() in a finally block.

Shared Weight Buffers

If multiple models share a common backbone (e.g., three fine-tuned variants of the same base model), load the shared layers once and swap only the task-specific heads. This requires exporting models with split weight files—base.onnx and head_A.onnx—and manually composing the computation graph.

ONNX Runtime supports this via session options that map external weight tensors. Load base.onnx once, then create three sessions sharing the base weights but loading different head files. Memory usage drops from 3.3GB (three full models) to 1.4GB (one base + three heads). Inference latency is identical because the shared weights remain in memory.

The export process is non-trivial. Use ONNX GraphSurgeon to split the model at the final pooling layer. Serialize base layers to base.onnx with external_data=True. Export each head separately, ensuring input shapes match the base output. Test thoroughly—shape mismatches cause silent corruption.

Thermal and Power Constraints

Running inference on ANE (Apple Neural Engine) or Adreno GPU generates 3-4W of heat. Sustained load triggers thermal throttling within 60-120 seconds, depending on ambient temperature and device model. Throttling reduces clock speeds by 40-60%, doubling inference latency.

Interleaved execution naturally mitigates this by inserting idle periods between models. A 200ms inference followed by 300ms of UI rendering and I/O allows the SoC to cool. If models run back-to-back with no gaps, insert explicit 100-150ms sleeps between sessions. This feels counterintuitive—adding delays to reduce latency—but thermal throttling is worse.

Monitor thermal state via ProcessInfo.processInfo.thermalState on iOS or PowerManager.getCurrentThermalStatus() on Android. When the state exceeds .nominal, defer non-critical model loads or switch to smaller quantized variants. A 4-bit quantized model runs 40% faster and generates 35% less heat than the 8-bit version, at the cost of 2-3% accuracy.

Battery impact scales with GPU utilization. A 200ms inference at 80% GPU load consumes ~0.15% battery. Over a 10-minute session with 30 inferences, that's 4.5%. Interleaving reduces average GPU utilization to 50-60% by spreading load over time, cutting battery drain to 3%.

Practical Implementation

A production scheduler for a clinical speech app manages three models: a phoneme classifier (340MB), a fluency scorer (780MB), and a conversational LLM (1.1GB). The app runs on iPhone 12 and above (4GB RAM).

The scheduler maintains a task queue with two priority levels. Foreground tasks (user taps 'Analyze') preempt background tasks (pre-generating conversation starters). Each task specifies model_id, input, and a completion callback.

On dequeue, the scheduler checks if the required model is loaded. If not, it evicts the current model (if any) and loads the new one. Loading uses memory-mapped I/O to avoid doubling memory usage during the read. The model file lives in the app bundle, mapped read-only.

After inference, the scheduler checks the queue. If empty and thermal state is nominal, it predictively preloads the conversational LLM (the most commonly requested model). If the queue has pending tasks, it immediately loads the next model. The evicted model's memory is released before the new model loads, ensuring peak usage never exceeds 1.3GB (model + 200MB overhead).

Telemetry shows P95 latency of 420ms for foreground tasks and zero thermal throttling events over 8-minute sessions. Battery drain is 6% per session, down from 11% with the naive all-models-loaded approach.

Debugging and Telemetry

Interleaved execution is hard to debug. Model load failures, memory spikes, and race conditions are non-deterministic. Instrument aggressively.

Log every model load and unload with timestamps and memory snapshots. Use os_signpost on iOS to mark model lifecycle events in Instruments. Track peak memory usage per session—if it exceeds your budget, you have a leak or failed eviction.

Monitor task queue depth. If it grows unbounded, your models are slower than your input rate. Add backpressure: drop low-priority tasks when the queue exceeds 5 items. Expose queue depth and model load times in a debug overlay during development.

Crash reports from memory pressure are opaque. Add a memory monitor that logs available RAM every 500ms. When a crash occurs, the last log entry reveals whether you hit the limit or something else failed. On iOS, use os_proc_available_memory() for accurate readings.

When Not to Interleave

Interleaving adds complexity. If your app uses a single model or has 6GB+ RAM budget, load everything at launch. The sequential pipeline pattern also breaks down for truly concurrent workloads—e.g., real-time video processing where three models must run in parallel to hit 30fps. In that case, use three separate GPU contexts and accept the memory cost, or move to a model fusion approach where a single multi-task model replaces three specialists.

For most mobile AI apps, though, interleaved execution is the difference between a product that ships and one that crashes in the field.