Differential Privacy in On-Device LLMs

The Privacy Problem in Personalized AI

On-device LLMs promise user privacy by keeping inference local. But personalization—fine-tuning on user data to improve suggestions, autocomplete, or domain-specific tasks—reintroduces the core privacy challenge: how do you update model weights without leaking sensitive information? Federated learning mitigates this at the server aggregation layer, but what about the local update step itself?

When building OfflineAI, a privacy-first mobile LLM runtime, we faced a concrete dilemma: users wanted the model to learn writing style and domain vocabulary from their documents, but logging raw gradients or attention weights to disk created a forensic trail. A subpoena or device compromise could reconstruct training samples. Local differential privacy (LDP) offers a mathematical guarantee: even with full access to the updated model, an adversary learns bounded information about any individual training example.

Differential Privacy Fundamentals

Differential privacy quantifies privacy loss with epsilon (ε). An algorithm is ε-differentially private if changing one record in the dataset changes the output distribution by at most e^ε. Lower ε means stronger privacy. For LLMs, this translates to: can an attacker distinguish whether a specific sentence was in the fine-tuning set?

Two mechanisms dominate: Gaussian noise addition (for real-valued outputs like gradients) and randomized response (for discrete choices). The challenge in mobile LLMs is balancing noise magnitude with model utility. Too much noise and the fine-tuning is worthless; too little and privacy degrades.

Privacy Budget Allocation

In a multi-epoch fine-tuning scenario, privacy budget compounds. If you run 10 epochs with ε=0.5 each, total privacy loss is ε_total ≈ 5.0 under basic composition (or √10 · 0.5 ≈ 1.58 under advanced composition with δ). For mobile apps, we typically target ε_total < 2.0 to maintain strong guarantees, which constrains iteration count and noise scale.

Gradient Perturbation for LoRA Updates

Most mobile LLM fine-tuning uses Low-Rank Adaptation (LoRA): freeze the base model and train small adapter matrices. This reduces memory and compute, but gradients still leak information. We apply DP at the gradient level before weight updates.

The algorithm: for each mini-batch, compute per-example gradients, clip them to bound sensitivity (typically L2 norm ≤ C = 1.0), then add Gaussian noise scaled to σ = C · √(2 ln(1.25/δ)) / ε. On a Snapdragon 8 Gen 2, clipping 4096-dim LoRA gradients adds ~2ms per batch; noise generation (using ARM NEON-accelerated Box-Muller) adds another 1ms.

// Pseudocode: DP-SGD step
for each example in batch:
  grad = compute_gradient(example)
  grad = clip_l2_norm(grad, max_norm=1.0)
  gradients.append(grad)

avg_grad = mean(gradients)
noise = gaussian(mean=0, std=sigma, shape=avg_grad.shape)
noisy_grad = avg_grad + noise
update_weights(noisy_grad)

Key insight: per-example gradient computation is expensive (no batching). We batch-accumulate gradients but apply clipping individually, then add noise once to the aggregated result. This trades some privacy (batch-level DP) for 10× speedup, acceptable when batch size is small (≤16).

Vocabulary Extension with Randomized Response

A second use case: expanding the tokenizer vocabulary with user-specific terms (medical jargon, product names, slang). Naively logging new tokens leaks information. Randomized response provides plausible deniability.

For each candidate token, flip a biased coin: with probability p = e^ε / (e^ε + 1), report the true token; otherwise, report a random token from a fixed dictionary. For ε=1.0, p≈0.73. This means ~27% of reported tokens are noise, but statistical aggregation over many users (or many sessions) recovers the true distribution.

In practice, we maintain a local Bloom filter of candidate tokens, apply randomized response at insertion time, then periodically sync the noisy filter to a server (if federated learning is enabled). The Bloom filter's false positive rate adds another layer of obfuscation.

Inference-Time Privacy: KV Cache Scrubbing

Fine-tuning isn't the only risk. The key-value cache in autoregressive generation stores activations from previous tokens, which can leak prompt content if persisted to disk (for multi-turn conversations). We implement ephemeral KV caching: allocate the cache in non-swappable memory (using mlock on iOS, PROT_NONE on Android), and zero it explicitly on app backgrounding.

For long-context scenarios (8K+ tokens), this conflicts with warm-start optimization. Our compromise: persist only the first N tokens (system prompt + few-shot examples) in encrypted storage, and regenerate user-specific KV entries on each session. Adds ~300ms cold-start latency but eliminates the largest privacy surface.

Measuring Utility Degradation

Privacy isn't free. We benchmarked LoRA fine-tuning on a 1.3B parameter model (quantized to 4-bit) with 500 synthetic medical notes. Baseline (no DP): perplexity 12.4, F1 on entity extraction 0.87. With ε=2.0, δ=10^-5: perplexity 13.1, F1 0.84. With ε=0.5: perplexity 15.8, F1 0.78. The sweet spot for healthcare apps is ε≈1.0, where utility loss is 10 occurrences) was 94%, but rare terms (