Why Uniform Quantization Breaks Down at 4-bit

When you quantize a 7B-parameter LLM uniformly to 4-bit integers, you compress the model from 14GB to 3.5GB—a tempting 4× reduction. But perplexity degrades catastrophically: GPT-style models lose 8-12 points on common benchmarks, and instruction-following collapses. The problem is activation outliers. While weight distributions are relatively Gaussian and amenable to symmetric quantization, activations exhibit heavy tails with rare but critical outliers that dominate the dynamic range.

Clipping these outliers destroys information; preserving them wastes quantization bins on values that occur in fewer than 0.01% of tokens. Uniform 4-bit quantization forces a Faustian bargain: either lose critical signal or waste representational capacity.

Hybrid Quantization: Asymmetric Precision by Tensor Role

The solution is role-based precision assignment. Weights, which are static and known at quantization time, tolerate aggressive 4-bit per-channel quantization with minimal accuracy loss. Activations, which vary wildly across inputs and exhibit outliers, require 8-bit dynamic quantization to preserve fidelity. This asymmetry yields a Pareto improvement: memory footprint drops 3.2× (not 4×, due to activation overhead), but perplexity degradation stays under 2 points—acceptable for production.

In a shipping on-device LLM product, this translates to fitting a 7B model in 4.3GB of RAM (weights + KV cache + activation buffers) instead of 14GB, enabling deployment on devices with 6GB total memory after OS and app overhead. The tradeoff is compute: 4-bit×8-bit matrix multiplication requires custom kernels, since standard BLAS libraries assume uniform precision.

Per-Channel Weight Quantization

For weights, we apply per-output-channel symmetric quantization. Each row of a weight matrix W (corresponding to one output neuron) gets its own scale factor s computed as s = max(abs(W_row)) / 7. The 4-bit range [-7, 7] is asymmetric to reserve zero for sparsity exploitation. Quantized weights are W_q = round(W / s), stored as int4. At inference, we dequantize on-the-fly: W_reconstructed = W_q × s.

Per-channel granularity is critical. Per-tensor quantization (one scale for the entire matrix) loses 3-4 perplexity points because it cannot adapt to the 10-100× variance in weight magnitudes across output channels, common in attention projection layers. Per-channel adds negligible overhead: 4096 float32 scales for a 4096×4096 matrix is 16KB, amortized across 8MB of quantized weights.

Dynamic 8-bit Activation Quantization

Activations are quantized dynamically per-token using asymmetric min-max scaling. For each activation tensor A, we compute min_A and max_A, then map the range [min_A, max_A] to [0, 255]. The quantization formula is A_q = round((A - min_A) / ((max_A - min_A) / 255)). This preserves outliers without clipping, at the cost of two extra scalar operations per tensor.

We quantize immediately after each activation function (GELU, SiLU) and dequantize before the next matmul. This keeps activation tensors in int8 during transit between layers, cutting memory bandwidth by 4× and enabling faster cache line fills on ARM CPUs. The dequantization overhead—two FMAs per element—is negligible compared to matmul cost, adding less than 3% latency in practice.

Kernel Engineering: 4-bit×8-bit Matmul

Standard GEMM libraries like Accelerate (iOS) and Eigen (Android) assume uniform precision. A 4-bit×8-bit multiplication requires custom kernels that unpack 4-bit weights, multiply with 8-bit activations, accumulate in int32, then dequantize to float32. The critical optimization is SIMD-parallel unpacking.

On ARM NEON, we load 16 bytes of packed 4-bit weights (32 values) into a 128-bit register, unpack to two 64-bit registers of 8-bit values using vzip and shift instructions, then use vmull_s8 for 8×8→16-bit multiplication. Accumulation happens in int32 to avoid overflow, then we apply the combined weight-activation scale factor and convert to float32. This achieves 85% of the throughput of uniform 8-bit×8-bit matmul—acceptable given the 3.2× memory savings.

A practical gotcha: iOS disallows runtime code generation, so we must template-specialize kernels for common matrix shapes (powers of 2, multiples of 64) at compile time. This bloats binary size by 400KB but avoids the 2-3× slowdown of a fully dynamic unpacking loop.

Memory Layout and Cache Efficiency

We store weights in column-major order with 4-bit values packed into bytes (two weights per byte). This aligns with the column-wise access pattern of matmul and ensures each cache line (64 bytes on modern ARM) contains 128 contiguous weight values for one output channel, minimizing cache misses. Activations are row-major (batch-first) to match typical tensor layouts in ML frameworks.

For KV cache in attention layers, we keep keys and values in 8-bit to avoid accuracy loss in dot-product scoring. A 7B model with 32 layers and 2048-token context uses 2048×32×2×4096×1 byte = 512MB for KV cache, far less than the 4.3GB weight footprint. Quantizing KV to 4-bit saves another 256MB but degrades attention by 1.5 perplexity points—not worth it.

Calibration: Choosing Quantization Parameters

Hybrid quantization requires a calibration pass over representative data to compute weight scales and measure activation ranges. We run 512 samples from the training distribution through the unquantized model, collecting min/max statistics for each activation tensor. Weight scales are deterministic (per-channel max), but activation ranges vary by input.

A key decision: use 99th percentile of activation ranges instead of absolute max to avoid rare outliers skewing the scale. This clips 1% of values but prevents quantization bins from being wasted on a single anomalous token. In practice, this recovers 0.5 perplexity points compared to absolute max scaling.

Calibration data must match the deployment distribution. For a medical chatbot, we use clinical notes and patient queries; for a coding assistant, GitHub snippets. Mismatch causes silent accuracy loss—a model calibrated on Wikipedia but deployed on legal contracts will exhibit 3-4 point perplexity degradation because legal text has different activation statistics (longer sentences, rare terminology).

Production Tradeoffs and When to Avoid Hybrid Quantization

Hybrid quantization is not a silver bullet. The 3.2× memory savings come with costs: 12-15% slower inference due to custom kernels, 400KB larger binary, and fragile calibration requirements. For models under 3B parameters, uniform 8-bit quantization is simpler and fast enough. For extremely memory-constrained devices (sub-4GB RAM), consider 4-bit weights with 16-bit activations instead—this loses 1 perplexity point but avoids the kernel complexity.

The sweet spot is 6-8B models on mid-range devices (iPhone 13, Pixel 7) where 8-bit uniform quantization exceeds memory budgets but 4-bit uniform loses too much accuracy. In a recent healthcare app deployment, hybrid quantization enabled a 7B clinical LLM to run on-device with 4.8GB total RAM usage, maintaining 92% accuracy on diagnosis suggestion tasks compared to the full-precision baseline.

Integration with Speculative Decoding

Hybrid quantization composes well with speculative decoding, where a small draft model proposes tokens and a larger target model verifies. We quantize the draft model to uniform 4-bit (acceptable since it only needs to match the target's distribution, not be accurate) and the target to hybrid 4-bit/8-bit. This keeps both models in memory simultaneously—draft at 800MB, target at 4.3GB—while maintaining verification accuracy above 85%.

Implementation Checklist

When implementing hybrid quantization for mobile LLMs, prioritize these steps:

  • Profile activation distributions on calibration data; use 99th percentile for scale computation to avoid outlier skew.
  • Implement per-channel weight quantization with int4 storage and float32 scales; validate perplexity loss stays under 1 point.
  • Write SIMD-optimized 4-bit×8-bit matmul kernels for target architecture (NEON for ARM, SSE4 for x86); template-specialize for common shapes.
  • Quantize activations dynamically per-token using asymmetric min-max; keep KV cache at 8-bit to preserve attention accuracy.
  • Test on distribution-matched data; recalibrate if deployment domain shifts significantly from training data.
  • Measure end-to-end latency and memory; hybrid quantization should deliver 3× memory reduction with under 20% latency increase.

Hybrid quantization represents the current Pareto frontier for mobile LLM deployment: aggressive enough to fit large models on consumer hardware, conservative enough to preserve task accuracy. As hardware accelerators add native 4-bit×8-bit instructions (rumored in next-gen mobile SoCs), the compute overhead will vanish, leaving only the memory savings.