Quantized Embedding Layers: 4-bit Mobile Search

The Embedding Size Problem

Semantic search on mobile devices faces a fundamental constraint: embedding models trained for 768-dimensional vectors at FP32 precision consume roughly 3KB per document vector. A catalog of 100,000 products requires 300MB just for the embedding index before accounting for metadata, making offline-first search impractical for resource-constrained devices.

Traditional approaches—dimensionality reduction via PCA or model distillation—sacrifice retrieval quality. When building search infrastructure for e-commerce applications with tens of thousands of SKUs, the challenge becomes: how do we compress embeddings by 8× while maintaining top-10 recall above 90%?

Asymmetric Quantization Architecture

The key insight is that query and document embeddings can use different quantization schemes. Documents are indexed once and queried thousands of times, while queries are ephemeral. This asymmetry allows aggressive compression of document vectors while keeping query vectors at higher precision.

4-bit Scalar Quantization

For each 768-dimensional document embedding vector v, we compute per-vector scale and zero-point parameters:

scale = (max(v) - min(v)) / 15
zero_point = round(-min(v) / scale)
quantized[i] = clip(round(v[i] / scale) + zero_point, 0, 15)

Each dimension now occupies 4 bits instead of 32, reducing storage by 8×. The scale and zero-point require 8 bytes total per vector (FP32 each), adding negligible overhead for typical embedding dimensions.

Lookup Table Optimization

Computing dot products between FP32 query vectors and INT4 document vectors naively requires dequantization. Instead, we precompute a 16×768 lookup table mapping each possible 4-bit value to its dequantized FP32 equivalent for every dimension:

LUT[dim][quant_val] = (quant_val - zero_point[doc]) * scale[doc]

Query-time dot product becomes a series of table lookups and accumulations, eliminating per-operation floating-point math. On ARM NEON, we pack four 4-bit values into uint16 and use SIMD shuffles to gather four lookup results per instruction.

Calibration and Range Selection

Naive min-max quantization suffers from outliers. A single extreme value can compress the majority of the dynamic range into a few quantization bins, losing precision where it matters most.

Percentile Clipping

We compute the 0.1th and 99.9th percentiles of each embedding dimension across the training corpus and use these as quantization bounds instead of true min/max. Values outside this range are clipped. Empirically, this recovers 3-5% recall on tail queries while sacrificing 1024 dimensions, switching to FP64 accumulators becomes necessary.

Comparison to Alternatives

Product quantization (PQ) achieves similar compression ratios by learning vector quantization codebooks. However, PQ requires offline training on the document corpus and struggles with incremental updates—adding a single document necessitates recomputing cluster assignments. Scalar quantization applies per-vector without global coordination, enabling streaming index construction.

Binary embeddings (1-bit quantization) push compression further but lose too much accuracy for most applications. Hamming distance correlates poorly with cosine similarity at