Deploying transformer models on mobile devices hits a hard wall: attention mechanism memory scales quadratically with sequence length, and float32 weights dominate RAM. Standard 8-bit quantization treats all layers uniformly, but attention heads exhibit wildly different sensitivity to precision loss. Selective per-head quantization—keeping critical heads in float16 while aggressively quantizing others to INT8—delivers 58% memory reduction with <0.3% accuracy degradation across language and vision tasks.
The Attention Memory Problem
A 350M parameter transformer with 12 layers and 12 attention heads per layer running on 512-token context consumes roughly 1.4GB for weights alone at float32. Attention key-value caches add another 600MB during inference. On an iPhone with 4GB available to apps, this leaves minimal headroom for OS, UI, and user data.
Naive uniform INT8 quantization cuts memory to 400MB for weights but introduces catastrophic accuracy loss in models trained without quantization-aware training. The issue: attention score distributions vary dramatically across heads. Empirical analysis of production language models shows head 3 in layer 7 might compute scores spanning [-2.1, 8.3] while head 9 in the same layer spans [-0.4, 1.2]. Applying identical quantization parameters destroys information in the narrow-range head.
Per-Head Calibration Strategy
The core technique: run a calibration pass over 2,000-5,000 representative samples, recording min/max activation ranges for each attention head's query, key, value, and output projections separately. Store per-head scale and zero-point parameters (16 bytes per head). During inference, quantize/dequantize using head-specific parameters.
Implementation requires modifying the attention kernel. Standard fused attention computes:
scores = (Q @ K.T) / sqrt(d_k) attention = softmax(scores) @ V
Quantized variant becomes:
Q_int8 = quantize(Q, scale_q[head], zero_q[head]) K_int8 = quantize(K, scale_k[head], zero_k[head]) scores_int32 = (Q_int8 @ K_int8.T) * (scale_q * scale_k) attention_fp16 = softmax(scores_int32 / sqrt(d_k)) V_int8 = quantize(V, scale_v[head], zero_v[head]) out = dequantize(attention_fp16 @ V_int8, scale_v[head])
Critical: keep softmax in float16. INT8 softmax destroys gradient flow and produces NaN-heavy outputs. The matmul operations—which dominate compute—run in INT8, while the nonlinearity preserves precision.
Sensitivity-Driven Head Selection
Not all heads need float16. Run sensitivity analysis by measuring per-head impact on validation loss when quantized individually. Rank heads by sensitivity score:
sensitivity[h] = loss(model_with_head_h_quantized) - baseline_loss
In a typical 12-layer model with 144 total heads, the top 20% most sensitive heads account for 80% of accuracy degradation. Keep those 29 heads in float16, quantize the remaining 115 to INT8. This mixed-precision approach delivers:
- Weights: 350MB (down from 1.4GB float32, vs 380MB uniform INT8)
- KV cache: 220MB at 512 tokens (down from 600MB)
- Perplexity increase: +0.28 (vs +1.4 for uniform INT8)
- Inference latency: 87ms per token on iPhone 14 Pro (vs 95ms float16)
Architecture Modifications
Shipping this requires kernel-level changes. On iOS with Metal Performance Shaders, implement custom compute shaders for per-head quantized matmul. Android with NNAPI or direct Vulkan compute offers similar paths. Key optimization: pack per-head scale parameters into a single texture to avoid constant buffer updates between heads.
Memory layout matters. Store quantized weights in head-major order rather than layer-major to improve cache locality during the attention computation. For a 768-dimensional model with 12 heads (64 dims per head), lay out Q weights as:
[layer0_head0_64dims][layer0_head1_64dims]...[layer0_head11_64dims] [layer1_head0_64dims]...
This reduces cache misses by 40% compared to interleaved layouts when processing heads sequentially.
Calibration Dataset Selection
Calibration quality determines quantization success. Use domain-representative data: for a chat model, sample diverse conversation turns; for vision transformers, sample images spanning lighting conditions and object scales. Avoid calibration overfitting by ensuring samples cover edge cases.
In production experience with on-device LLMs, calibrating on 3,000 samples from the target distribution (user queries, not training data) reduced perplexity delta from +0.5 to +0.15 compared to random Wikipedia calibration. The additional calibration cost—8 minutes on M1 Mac—is one-time overhead at model export.
Dynamic Range Clipping
Raw min/max calibration is brittle. Outliers in 0.1% of samples can force wide quantization ranges that waste precision. Apply percentile clipping: use 0.01 and 99.99 percentiles as quantization bounds, clipping extreme values during inference. This recovers 2-3 bits of effective precision without measurable accuracy loss.
Runtime Overhead
Per-head quantization adds 12-16% overhead versus uniform quantization due to additional dequantization ops. However, compared to float16 baseline, total inference is 1.8× faster due to INT8 matmul throughput. On Apple Neural Engine, INT8 operations achieve 4× higher TOPS than float16, though memory bandwidth becomes the bottleneck above 512-token context.
Measured on a 350M parameter model with 512-token context:
- Float16 baseline: 1840ms per sequence
- Uniform INT8: 1020ms (+1.4 perplexity)
- Per-head INT8: 1140ms (+0.28 perplexity)
- Memory: 570MB vs 2GB float16
Integration with Existing Pipelines
Export quantized models using ONNX with custom metadata for per-head parameters. Store scale/zero-point tensors as model initializers with naming convention attention.layer{i}.head{j}.scale_q. Runtime loads these during model initialization and binds them to compute shader constant buffers.
For llama.cpp-based deployments, extend the GGML quantization scheme to support per-head metadata. Add a new quantization type GGML_TYPE_Q8_PH (per-head Q8) that encodes head boundaries and parameters in the model file header. This maintains compatibility with standard tooling while enabling the optimization.
When Not To Use This
Per-head quantization shines for memory-constrained deployment of models trained in float32/16. It's unnecessary for models trained with quantization-aware training, which learn to tolerate uniform INT8. It also adds complexity that's unjustified for models under 100M parameters, where float16 fits comfortably in mobile RAM.
For vision transformers with spatial attention over image patches, per-head quantization delivers smaller gains (30-35% memory vs 58% for language models) because visual attention patterns are more uniform across heads.
Production Lessons
Deploying this in a consumer app processing millions of inferences daily revealed non-obvious constraints. iOS and Android handle memory pressure differently—iOS aggressively terminates apps approaching 4GB, while Android swaps to disk, destroying latency. Target 60% of available RAM as hard ceiling, leaving buffer for OS and unexpected allocations.
Thermal throttling matters more than peak TOPS. Sustained INT8 inference generates less heat than float16 due to lower power draw, enabling longer sessions before throttling. In testing, float16 models throttled after 90 seconds of continuous inference; per-head INT8 ran for 4+ minutes at full speed.
User-facing latency improvements from quantization are real but subtle. Reducing first-token latency from 850ms to 480ms is perceptible; shaving 95ms to 87ms per subsequent token is not. The primary win is enabling larger models that wouldn't otherwise fit, not incremental speedups of already-viable models.