Embedding tables are the silent giant in mobile NLP models. A typical multilingual tokenizer with 50,000 vocabulary entries and 768-dimensional embeddings consumes 150MB in float32—often 60-80% of total model size. For on-device deployment where every megabyte counts against app store limits and user patience, this overhead is unacceptable. Yet embeddings are peculiarly amenable to aggressive quantization: their lookup-table nature means no accumulation of rounding errors, and their learned distributions often cluster tightly around zero.
The Embedding Bottleneck in Production
Consider a cross-platform sentiment analysis model shipping in a Flutter app. The transformer backbone—six encoder layers with multi-head attention—compiles to 40MB in ONNX with 8-bit weights. The embedding layer alone: 150MB. Users on 64GB devices don't notice. Users on budget Android phones with 32GB storage and 3GB RAM notice immediately. Cold start climbs from 800ms to 2.1 seconds. Memory pressure triggers background evictions.
The problem compounds in multi-task models. A speech therapy app recognizing phonemes, words, and intent needs three embedding tables. A healthcare chatbot supporting English, Arabic, and transliterated Arabic needs separate vocabularies. Suddenly you're shipping 400MB of lookup tables.
Asymmetric Quantization: The Right Tool
Symmetric quantization—mapping float range [-α, α] to int8 [-127, 127]—wastes half the integer range when embeddings are zero-centered but asymmetric. Embedding weights learned via AdamW typically have distributions like N(0, 0.08), with 95% of values between -0.16 and 0.16. Symmetric quantization forces the zero-point to integer zero, leaving negative values underrepresented.
Asymmetric quantization introduces a learned zero-point z and scale s per row (or per table), storing each weight w as:
q = round(w / s) + z
Dequantization becomes:
w' = (q - z) * s
This shifts the integer range to match the actual distribution. For embeddings centered near zero with slight negative skew, z might be -15, utilizing integers [-127, 127] more efficiently. Per-row scaling adapts to heterogeneous token frequencies: rare tokens (low gradient updates) often have smaller magnitudes and benefit from finer quantization.
Implementation: ONNX Runtime Custom Ops
ONNX Runtime supports quantized Gather operations, but the built-in op assumes symmetric quantization. For production deployment, we implemented a custom QuantizedEmbedding op in C++ with NEON intrinsics for ARM and AVX2 for x86. The kernel:
- Loads int8 embedding row based on token ID
- Broadcasts scale and zero-point (stored as fp16 to save 50% overhead)
- Performs vectorized dequantization:
(vsubq_s8(q_vec, z_vec)) * s - Writes fp32 output for downstream layers
On an iPhone 13 Pro (A15), this adds 0.3ms per 512-token batch versus float32 lookup—negligible compared to attention's 12ms. On a Snapdragon 8 Gen 1, the gap is 0.5ms. Memory bandwidth savings (reading 150MB → 40MB) dominate cache behavior, improving total inference latency by 8-15% despite dequantization overhead.
Calibration: Post-Training Quantization
We quantize after training using a calibration dataset of 10,000 representative inputs. For each embedding row:
- Collect all accessed weights during calibration (many rows are never hit)
- Compute min and max per row
- Calculate scale:
s = (max - min) / 255 - Calculate zero-point:
z = round(-min / s) - Quantize:
q = clamp(round(w / s) + z, -128, 127)
Unused rows (e.g., rare Unicode tokens in a Latin-heavy corpus) get default quantization parameters. This matters: a 50K vocabulary typically sees 15K active tokens in production, so we avoid wasting calibration compute on the long tail.
Accuracy Impact: Negligible in Practice
Across three production models—sentiment (GLUE SST-2), named entity recognition (CoNLL-2003), and intent classification (ATIS)—quantized embeddings degraded F1 by 0.3-0.7 points. Original: 91.2% F1. Quantized: 90.8% F1. Perplexity in a causal language model increased from 18.4 to 18.9. These deltas are invisible to end users and far smaller than variance from hyperparameter tuning.
Why so robust? Embeddings are input representations, not decision boundaries. Downstream layers learn to be invariant to small perturbations. Additionally, the quantization noise is consistent: the same token always gets the same quantized vector, so the model doesn't face distribution shift at inference.
Outlier Handling
In 1-2% of vocabularies, a handful of tokens (often punctuation or special tokens like [CLS]) have embedding norms 3-5× larger than the mean. Quantizing these with the same scale crushes their magnitude. We apply a simple heuristic: if a row's L2 norm exceeds 2.5× the median, store it in fp16 instead of int8. This hybrid approach costs 0.1MB for 200 outlier rows while preserving the critical 70% overall savings.
Deployment: Cross-Platform Considerations
On iOS, ONNX Runtime with CoreML execution provider can delegate quantized ops to the Neural Engine—but only if you use Apple's symmetric quantization format. We ship two model variants: asymmetric quantized for ONNX Runtime CPU (Android, server), symmetric quantized for CoreML (iOS). A build-time script generates both from the same trained weights. The accuracy gap between the two quantization schemes is under 0.2 F1 points.
On Android, the NNAPI backend has spotty support for custom quantization ops across vendors. We fall back to CPU execution, which is fine: Snapdragon Hexagon DSP and Mali GPU acceleration matter more for convolution-heavy models. For embeddings, memory bandwidth is the bottleneck, and quantization wins regardless of accelerator.
Beyond 8-bit: Mixed-Precision Embeddings
After shipping int8 embeddings in a glucose monitoring app's on-device NLP (analyzing user food logs), we experimented with 4-bit quantization for the 30K least-frequent tokens. Calibration showed these tokens—mostly rare food names and typos—had minimal impact on downstream accuracy. We stored the top 20K tokens in int8 (60MB) and the tail in int4 (7.5MB), plus a 20KB lookup table mapping token IDs to quantization tier. Total size: 67.5MB versus 150MB float32 or 40MB uniform int8. F1 dropped 0.4 points versus full int8, but app size decreased by another 40MB—critical for markets where users have 16GB devices.
Lessons from Production
After deploying quantized embeddings in six apps over 18 months, three patterns emerged:
- Profile before optimizing: One model's bottleneck was actually the positional encoding table (sinusoidal, 4096 positions), not the token embeddings. Quantizing it saved 12MB with zero accuracy loss.
- Calibration data matters: Using synthetic data for calibration caused a 2.1 F1-point drop. Real user inputs—even just 5,000 samples—recovered 1.8 points.
- Monitor outliers: A model update introduced new special tokens with large norms, causing a 3-point F1 regression until we added them to the fp16 outlier set.
When Not to Quantize Embeddings
If your model is under 20MB total, the engineering overhead isn't worth it. If embeddings are less than 30% of model size, quantize the transformer layers first—attention weight matrices compress better (often 4:1 versus embeddings' 2.5:1). If you're training with quantization-aware training (QAT), symmetric quantization integrates more easily with most frameworks.
But for the common case—large vocabulary, mobile deployment, post-training quantization—asymmetric quantized embeddings deliver the best size-accuracy tradeoff available today. The technique is mature, tooling exists, and the wins are immediate.