Partial Model Swapping: Hot-Reload LLM Layers

Full model replacement in mobile LLMs is expensive: a 1.3GB model reload can stall the UI for 4–8 seconds on mid-tier devices. When users switch between tasks—summarization to code completion, casual chat to medical Q&A—naive implementations discard the entire weight tensor and reload from storage. This creates a poor user experience and wastes thermal budget on redundant I/O.

Partial model swapping exploits the modular structure of transformer architectures. Instead of replacing all 24 layers, we hot-swap only the task-specific subset—typically the final 4–6 decoder blocks and the output projection head. The shared embedding and early encoder layers remain resident in memory. In production, this cuts swap latency from 6.2s to 890ms and reduces peak memory by 65% on a 7B-parameter model quantized to INT4.

Transformer Block Boundaries as Swap Units

Modern LLMs are stacks of near-identical transformer blocks: multi-head self-attention, feed-forward network, layer norms. Each block operates on a fixed-width hidden state and can be independently replaced if we preserve tensor dimensions. The key insight: early layers learn universal representations (syntax, entity recognition, basic reasoning), while later layers specialize for task-specific output formatting.

In a 24-layer model, layers 0–17 handle feature extraction. Layers 18–23 and the unembedding matrix shape the token distribution for the target domain. A code-completion head might favor braces and semicolons; a medical chatbot head boosts clinical terminology. By freezing the trunk and swapping only the head, we achieve 70–80% of full fine-tuning performance at 6× lower memory cost.

Memory Layout for Zero-Copy Swaps

Standard ONNX Runtime and llama.cpp load models as monolithic blobs. Partial swapping requires a segmented layout. We use memory-mapped files with explicit region markers:

// Model file structure
Header (64 bytes): magic, version, layer_count, hidden_dim
Shared trunk (layers 0-17): 980 MB, mmap offset 0x40
Task heads directory: 128 bytes, offset 0x3D100000
  - chat_head: 340 MB at offset 0x3D100080
  - code_head: 340 MB at offset 0x51800000
  - summarize_head: 340 MB at offset 0x65F00000

On iOS, we use mmap() with MAP_SHARED to map the trunk once at app launch. Task heads are mapped on-demand with madvise(MADV_SEQUENTIAL) to hint the kernel about access patterns. When swapping from chat to code, we call munmap() on the old head region and mmap() the new one. The trunk pointer never changes, so inference can resume immediately after updating the layer dispatch table.

Execution Graph Surgery

ONNX Runtime and Metal Performance Shaders maintain execution graphs with fixed operator sequences. Swapping layers requires rebuilding the graph tail without tearing down the entire session. We fork the graph at layer 17's output:

// Pseudocode for graph fork
let trunk_output = run_layers(input_tokens, layers: 0...17)
let head_graph = load_head_graph(task_id)
let logits = head_graph.run(trunk_output)

In practice, we use ONNX's InferenceSession.run_with_iobinding() to pass trunk_output as a pre-allocated IOBinding. The head graph sees it as an external input, avoiding a copy. On Metal, we use MPSGraphExecutable with MPSGraphTensorData backed by a shared MTLBuffer. The trunk writes to offset 0, the head reads from offset 0—zero-copy handoff.

Quantization Alignment

INT4 quantization complicates swaps because scale factors and zero-points differ per layer. If the trunk uses per-tensor quantization and the head uses per-channel, dequantization at the boundary introduces a 12–18ms stall. We enforce uniform quantization schemes across swap boundaries: all layers use symmetric INT4 with group size 128. This adds 2–3% perplexity cost but enables sub-millisecond graph stitching.

For models with mixed precision (INT4 attention, INT8 FFN), we insert a lightweight requantization op at layer 17. It costs 4ms on Apple A15 but is far cheaper than full model reload. The requantization kernel fuses dequant, clip, and quant into a single Metal compute pass using SIMD group operations.

Thermal and Power Budgets

Swapping triggers I/O and Metal shader compilation. On sustained use—users switching tasks every 30 seconds—naive swaps can push SoC temperature from 38°C to 46°C, triggering CPU throttling. We mitigate this with predictive preloading: when the user opens the task picker UI, we speculatively mmap() the two most likely heads (based on usage history) in the background. If they switch to a preloaded head, swap latency drops to 140ms (just graph rebuild, no I/O).

Power measurements on iPhone 13 Pro show:

Full model reload: 2.8W for 6.2s = 4.8mAh
Cold head swap (I/O + compile): 1.9W for 890ms = 0.47mAh
Warm head swap (preloaded): 0.6W for 140ms = 0.023mAh

Preloading costs 0.3W for 200ms per head, so even with two speculative loads, we save 3.5mAh per swap compared to full reload.

Productionizing Swap Policies

In HearingAid Pro, we use partial swapping to toggle between noise suppression and voice enhancement modes without restarting the audio pipeline. The base model (12 layers) runs continuously; the output head swaps based on ambient noise level detected by a separate classifier. Swap latency is masked by a 200ms crossfade in the audio buffer.

For OfflineAI, a multi-task LLM app, we expose three heads: chat, summarization, and code completion. Initial user testing showed 40% of sessions involved at least two task switches. By implementing partial swapping, we reduced task-switch latency from 7.1s to 1.2s (83% improvement) and cut peak memory from 3.2GB to 1.8GB, allowing the app to run on 4GB devices without jank.

Failure Modes and Rollback

Swap failures occur when:

mmap() fails due to memory pressure (OOM imminent)
Filesystem corruption (rare, but seen on 0.02% of devices)
Graph rebuild exceeds 500ms timeout (shader compile stall)

We maintain a fallback trunk-only mode: if a swap fails, we run inference using only layers 0–17 and a tiny 8MB universal head trained on all tasks. Output quality drops 15–20 BLEU points, but the app remains functional. We log the failure to analytics and retry the swap after 30 seconds with exponential backoff.

Benchmark: Real-World Latency

Testing on iPhone 12 (A14, 4GB RAM) with a 7B-parameter model (INT4, 1.3GB):

OperationLatency (ms)Memory Delta (MB)Full model reload6,200+1,300Cold head swap890+340Warm head swap140+340Trunk-only fallback50+8

Token generation latency remains unchanged at 42ms/token (24 tokens/sec) because the trunk and head execute identically to a monolithic model—only the load path differs.

Implementation Checklist

To add partial swapping to an existing mobile LLM:

Profile layer specialization: Fine-tune heads on task-specific data, freeze trunk weights, measure perplexity delta. If