Running a single LLM inference session on a mobile device is now routine—llama.cpp, ONNX Runtime Mobile, and CoreML all ship production-ready engines. But what happens when your app needs to handle multiple concurrent users or parallel tasks on the same model? A voice assistant answering a query while summarizing background text. A collaborative note-taking app with three users typing simultaneously. A clinical decision-support tool processing patient records in parallel.
Naive approaches fail fast: spawn three separate inference sessions and iOS kills your app at 2.1GB. Queue requests serially and latency becomes unacceptable—18 seconds to answer three 200-token prompts sequentially on an A15. The solution lies in split-batch inference: a technique borrowed from server-side LLM deployments, adapted for mobile's memory and thermal constraints.
The Memory Wall: Why Separate Sessions Don't Scale
Each LLM inference session maintains a key-value cache—a matrix storing computed attention keys and values for every token processed. For a 7B parameter model with 32 layers and 4096-dimensional embeddings, each token consumes roughly 1MB of KV cache (FP16). A 512-token conversation requires 512MB per session. Three concurrent sessions: 1.5GB before you've allocated the model weights themselves.
On iPhone 14 Pro with 6GB total RAM, the OS reserves ~2GB, leaving 4GB for apps. After loading a quantized 7B model (3.8GB for Q4_K_M), you have 200MB headroom. A single long conversation fits. Two concurrent sessions trigger memory warnings. Three cause jetsam termination.
The core insight: model weights are read-only and shareable. The KV cache is not. But we can interleave decode steps across multiple requests, sharing the weight memory while keeping separate KV state.
Interleaved Decode: Round-Robin Token Generation
Standard autoregressive decoding processes one request at a time: load prompt, generate token 1, append to KV cache, generate token 2, repeat until EOS. Split-batch decoding runs a fixed-size batch of requests in lockstep, generating one token per request per iteration.
Pseudocode for a batch size of 3:
requests = [req_A, req_B, req_C]
kv_caches = [cache_A, cache_B, cache_C]
while any_active(requests):
for i, req in enumerate(requests):
if req.finished:
continue
token = model.decode_one(
input=req.last_token,
kv_cache=kv_caches[i]
)
req.append(token)
if token == EOS:
req.finished = TrueEach decode_one call performs a forward pass through all 32 transformer layers, reading shared weights but writing only to that request's KV cache. Memory footprint: 3.8GB weights + (3 × 512MB) KV = 5.3GB—manageable if we keep prompts under 512 tokens and use 8-bit KV quantization (halves cache size).
Latency changes character. Instead of 6 seconds per request sequentially (18s total), all three requests complete in ~8 seconds—30% longer than a single request due to cache thrashing, but 2.25× faster than serial execution. The key tradeoff: individual request latency increases, but throughput improves and memory stays bounded.
Dynamic Batch Sizing: Adapting to Device State
Fixed batch size 3 works on iPhone 14 Pro. On iPhone 12 with 4GB RAM, it OOMs. On iPad Pro M2 with 8GB, we're leaving performance on the table. The batch size must adapt to available memory and thermal state.
We measure at runtime:
available_ram = os_proc_available_memory() model_size = 3.8GB per_request_budget = 400MB // KV + activations max_batch = floor( (available_ram - model_size) / per_request_budget )
On a cold device with 1.2GB free, max_batch = 2. After 90 seconds of inference when thermals kick in and the system reclaims memory, we drop to max_batch = 1. The scheduler must gracefully handle batch size changes mid-flight—requests don't fail, they just queue until a slot opens.
Thermal throttling complicates this further. On A15, sustained inference at batch size 3 triggers CPU frequency scaling after ~60 seconds, dropping per-token latency from 85ms to 140ms. We monitor ProcessInfo.thermalState and preemptively reduce batch size at .serious (before hitting .critical and forced throttling). This keeps latency predictable: better to process two requests at 90ms/token than three at 150ms/token.
KV Cache Eviction: Handling Long Contexts
Split-batch assumes bounded context length. What if request A hits 512 tokens while B and C are still at 200? We can't grow cache_A indefinitely. Three strategies:
1. Hard truncation: Drop oldest tokens from KV cache once we hit the limit. Simplest, but the model loses context—fine for stateless queries, breaks conversational apps.
2. Sliding window: Keep only the most recent N tokens. Works well for summarization tasks where early context becomes irrelevant. In a clinical app processing a 2000-token patient record, we used a 512-token window that slid forward, re-encoding the window boundary every 256 tokens. Accuracy drop: