Gradient Checkpointing for Mobile LLMs: 75% Less RAM

The Memory Wall in Mobile LLM Fine-Tuning

Fine-tuning a 1.3B parameter language model on-device sounds ambitious, but user privacy and data sovereignty make it compelling for medical, legal, and enterprise apps. The primary blocker is not compute—modern mobile GPUs can handle 20-30 GFLOPS—but memory. A naive backward pass stores every activation from the forward pass, ballooning RAM to 4-6GB for even modest models. On a device with 6GB total and the OS reserving 2GB, you crash before the first gradient update completes.

Gradient checkpointing (also called activation checkpointing or rematerialization) trades compute for memory. Instead of caching all intermediate activations, you save only a subset—typically at transformer block boundaries—and recompute the rest during the backward pass. This technique, popularized by Chen et al. in 2016 for training deep CNNs, reduces peak memory from O(n) to O(√n) for an n-layer network. For a 24-layer transformer, that's the difference between 6GB and 1.2GB.

Where the Memory Goes

In a forward pass through a transformer block, you compute:

Multi-head attention: query, key, value projections, attention scores, context vectors
Feed-forward network: two linear layers with a GELU activation
Layer normalization: mean and variance statistics
Residual connections: element-wise additions

Each intermediate tensor persists in memory until backpropagation needs its gradient. For a 1024-hidden-dim model with batch size 1 and sequence length 512, a single attention score matrix alone occupies 512×512×2 bytes = 512KB. Multiply by 12 heads and 24 layers, and attention scores alone consume 147MB. Add Q/K/V projections, FFN activations, and normalization buffers, and a single forward pass holds 1.8-2.4GB resident.

Selective Checkpointing Strategy

The naive approach checkpoints every layer boundary, recomputing everything within each block. But mobile inference has asymmetric costs: memory allocation stalls the GPU, while FP16 matrix multiplies on Metal or Vulkan run at 80-90% of peak throughput. A smarter strategy checkpoints selectively:

Always checkpoint layer inputs. These are small (batch×seq×hidden) and enable independent recomputation of each block.
Discard attention scores. Recomputing softmax(QK^T) is cheap compared to storing 12×seq² tensors per layer.
Cache FFN intermediate activations. The FFN's first linear layer outputs 4× hidden dimension (4096 for a 1024-dim model). Storing this costs 16MB per layer but saves recomputing a 1024×4096 GEMM twice.
Drop normalization statistics. Layer norm's mean and variance are scalar per sequence position—cheap to recompute, expensive to cache across 24 layers.

This hybrid approach cuts peak memory by 70-75% while adding only 15-20% latency overhead. The key insight: memory bandwidth is the bottleneck, not FLOPS.

Implementation on iOS with Metal Performance Shaders

Apple's MLCompute framework does not expose fine-grained checkpointing hooks, so you drop to Metal. Here's a simplified Swift pseudocode for a single transformer block:

class CheckpointedTransformerBlock {
  var qkvProjection: MPSMatrixMultiplication
  var ffn1: MPSMatrixMultiplication
  var ffn2: MPSMatrixMultiplication
  var checkpointedInput: MTLBuffer?
  
  func forward(_ input: MTLBuffer) -> MTLBuffer {
    // Checkpoint input
    checkpointedInput = input.copy()
    
    // Attention (discard scores)
    let qkv = qkvProjection.encode(input)
    let attnOut = computeAttention(qkv) // scores not saved
    
    // FFN (cache intermediate)
    let ffnIntermediate = ffn1.encode(attnOut)
    let ffnOut = ffn2.encode(ffnIntermediate)
    
    return residualAdd(input, ffnOut)
  }
  
  func backward(_ gradOutput: MTLBuffer) -> MTLBuffer {
    // Recompute attention from checkpointed input
    let qkv = qkvProjection.encode(checkpointedInput!)
    let attnOut = computeAttention(qkv)
    
    // Gradient through FFN uses cached intermediate
    let gradFFN = ffn2.backward(gradOutput)
    let gradAttn = ffn1.backward(gradFFN)
    
    // Gradient through attention (recomputed)
    return attentionBackward(gradAttn, qkv)
  }
}

The critical detail: computeAttention runs twice—once in forward, once in backward—but its QK^T and softmax outputs never touch DRAM. They live in tile memory on Apple's GPU, which has 10× the bandwidth of main memory.

Android with ONNX Runtime Mobile

ONNX Runtime's mobile build supports custom operators but not automatic checkpointing. You manually partition the model into sub-graphs. Export a PyTorch model with explicit checkpoint boundaries:

class CheckpointedModel(nn.Module):
  def forward(self, x):
    x = checkpoint.checkpoint(self.block1, x)
    x = checkpoint.checkpoint(self.block2, x)
    # ... repeat for all blocks
    return x

Then export to ONNX with opset 17, which preserves the checkpoint operators as no-ops. At runtime, override these with a custom kernel that allocates temporary buffers:

// Kotlin/JNI bridge to C++
val session = OrtSession.SessionOptions().apply {
  registerCustomOps("CheckpointOp") { input ->
    // Allocate temp buffer, run sub-graph, free
    val temp = allocateTempBuffer(input.shape)
    runSubGraph(input, temp)
    temp.release()
    input // return original for gradient
  }
}

This approach requires shipping two ONNX models: one for inference (no checkpoints), one for fine-tuning (with checkpoints). The overhead is 40-60MB of app size, acceptable for a 200MB+ LLM app.

Latency Trade-offs in Production

Recomputing attention adds 8-12ms per layer on an iPhone 15 Pro (A17 Pro GPU). For a 24-layer model, that's 200-280ms per backward pass. Acceptable for asynchronous fine-tuning (user types, model updates overnight), catastrophic for interactive inference. The solution: checkpoint during fine-tuning, disable during inference. Profile both paths separately.

One shipped example: a clinical speech therapy app fine-tunes a 1.1B ASR model on patient-specific phoneme errors. Without checkpointing, peak memory hit 5.2GB, causing crashes on iPhone 13. With selective checkpointing, peak dropped to 1.4GB, and fine-tuning ran in 18 minutes overnight on-device. The 22% latency overhead was invisible to users, who never saw the training loop.

Debugging Memory Spikes

Xcode's Metal Debugger shows per-buffer allocation timelines. Common pitfalls:

Redundant copies: Checkpointing layer inputs by copying the entire tensor instead of retaining a reference. Use MTLBuffer.makeAliasedBuffer to avoid doubling memory.
Gradient accumulation: Summing gradients across micro-batches without clearing intermediate buffers. Explicitly zero gradients after each optimizer step.
Autorelease pools: Metal objects linger until the pool drains. Wrap backward passes in autoreleasepool { } to release buffers immediately.

On Android, use adb shell dumpsys meminfo and Android Studio's Memory Profiler. ONNX Runtime's allocator logs (enabled via ORT_LOGGING_LEVEL=VERBOSE) reveal which operators hog memory.

When Not to Checkpoint

Gradient checkpointing assumes compute is cheaper than memory. This breaks down when:

The model is small enough to fit in RAM without checkpointing (< 500M parameters on modern flagships).
Inference-only deployment, where no backward pass runs.
Real-time constraints demand sub-50ms latency, and recomputation pushes you over budget.

For a 350M parameter model, the memory savings (600MB → 180MB) rarely justify the complexity. Just use quantization (FP16 or INT8) and call it a day.

Practical Takeaways

Gradient checkpointing unlocks on-device fine-tuning for billion-parameter models, but it is not a silver bullet. The wins:

70-75% reduction in peak memory with 15-20% latency overhead.
Enables privacy-preserving personalization without cloud round-trips.
Pairs well with quantization: checkpoint a 4-bit model, fine-tune at FP16 precision.

The costs:

Increased code complexity and testing surface.
Platform-specific implementations (Metal on iOS, Vulkan or NNAPI on Android).
Debugging memory issues requires low-level profiling tools.

If your app demands on-device learning—medical records, legal documents, or personalized assistants—gradient checkpointing is the difference between a 2GB memory budget and a 6GB crash. The technique has been battle-tested in server training for years; mobile is catching up.