Cascaded Quantization: 8→4→2-bit LLM Inference

Most mobile LLM deployments choose a single quantization level—INT8, INT4, or occasionally INT2—and run the entire model at that precision. This works, but leaves performance on the table. Cascaded quantization applies different bit-depths to different stages of inference: high precision where it matters, aggressive compression where it doesn't. The result: 3× token throughput on mid-range mobile GPUs with negligible quality loss compared to uniform INT4.

The Uniform Quantization Ceiling

When you quantize a 7B-parameter model to INT4, you're making a blanket trade: 4× smaller weights, roughly 3× faster matmuls on mobile Metal or Vulkan backends, but ~0.5–1.5 perplexity increase depending on calibration. Push to INT2 and you hit a quality cliff—most tasks become unusable. The assumption: every layer needs the same precision.

Reality is messier. Early transformer layers (closer to embeddings) tolerate lower precision better than late layers (near the output head). Attention weights are more sensitive than feed-forward weights. And within a single forward pass, the KV cache read dominates memory bandwidth while matmul compute dominates ALU time—two bottlenecks that respond to quantization differently.

Three-Stage Cascade Architecture

Cascaded quantization splits inference into stages with decreasing bit-depth:

Stage 1: Embedding + Early Layers (INT8)

The first 25% of transformer blocks run at INT8. Embedding lookups are cheap, but the representations here set the foundation for downstream layers. An iPhone 15 Pro can execute ~12 TOPS INT8 on the Neural Engine; this stage consumes ~40ms for a 256-token context in a 3B model.

Stage 2: Middle Layers (INT4)

Layers 6–18 (in a 24-layer model) drop to INT4. These blocks contribute most of the compute—roughly 60% of total FLOPs—but tolerate aggressive quantization because errors average out across many layers. On Metal, INT4 matmuls hit ~8 TOPS on A17 Pro's GPU, 2.5× faster than INT8 for the same layer width.

Stage 3: Output Layers + Head (INT2)

The final 15% of layers—and critically, the KV cache—use INT2. By this point, the model's representation is nearly converged; the last layers perform refinement, not heavy lifting. The output projection (vocab_size × hidden_dim) is memory-bound, so halving cache bandwidth from INT4 to INT2 yields a 1.8× speedup even though compute precision drops.

Crucially, the attention scores stay in FP16. Quantizing Q·KT below 8 bits collapses softmax distributions; this is the one operation where precision is non-negotiable.

Implementation: Dynamic Weight Loading

You can't just compile three model variants. Instead, store weights in a hybrid format:

struct CascadedWeights {
  int8_t* early_layers;   // 0–5
  int4_t* mid_layers;     // 6–18 (packed)
  int2_t* late_layers;    // 19–23 (packed)
  float16_t* attn_scores; // all layers
}

At runtime, the inference loop switches kernels per stage. On iOS, this means three Metal compute pipelines with different precision annotations. On Android, three separate NNAPI subgraphs or Vulkan descriptor sets. The switching overhead is ~0.3ms per stage boundary—negligible compared to 40–80ms per full forward pass.

KV cache quantization happens after attention computation. Write keys/values at INT2 precision, dequantize on read. For a 2048-token context, this cuts cache size from 16MB (FP16) to 2MB (INT2), enabling 8× longer contexts before swapping to DRAM.

Calibration: Per-Stage Ranges

Uniform quantization uses a single scale factor per tensor. Cascaded quantization needs per-stage ranges. During calibration (offline, on representative prompts):

Run the model in FP16, record activation ranges for each layer.
Cluster layers by sensitivity: measure perplexity delta when quantizing each layer independently.
Assign bit-depths: low-sensitivity clusters → INT2, high-sensitivity → INT8.
Compute asymmetric min/max scales per stage, store in model metadata.

For a 3B Llama-style model, calibration takes ~20 minutes on a single GPU with 5,000 prompts from WikiText-103. The resulting scale tensors add