Huffman Coding for LLM Vocab: 35% Smaller Models

Large language models ship with vocabularies of 32K–50K tokens, each mapped to a dense embedding vector. For a 7B parameter model with 32K vocab and 4096-dimensional embeddings at FP16, the embedding table alone consumes 256MB—roughly 12% of total model size. On mobile devices where every megabyte matters for download time, storage, and memory pressure, this is a significant target for optimization.

Traditional quantization compresses embedding weights from FP16 to INT8 or INT4, but the vocabulary index itself—the mapping from token ID to embedding row—remains fixed-width. A 32K vocabulary requires 15 bits per token ID, typically stored as 16-bit integers. Huffman coding, a lossless variable-length encoding scheme, can compress this index structure by exploiting the non-uniform frequency distribution of tokens in natural language.

Frequency Distribution in LLM Vocabularies

Real-world text exhibits a power-law distribution: common tokens like "the", "and", "to" appear orders of magnitude more frequently than rare tokens like "xylophone" or "antidisestablishmentarianism". In a 10GB English corpus used to train a typical 7B model, the top 1000 tokens account for roughly 70% of all token occurrences, while the bottom 20K tokens collectively represent less than 5%.

Huffman coding assigns shorter bit sequences to frequent tokens and longer sequences to rare ones. Instead of a fixed 15-bit code for every token in a 32K vocabulary, common tokens might use 8–10 bits while rare tokens expand to 18–20 bits. The weighted average across the corpus shrinks to approximately 10.2 bits per token—a 32% reduction in index size.

Corpus-Specific vs Universal Codes

The choice of frequency table matters. A universal Huffman tree built from a large representative corpus (e.g., Wikipedia + Common Crawl) works well for general-purpose models. Domain-specific models—medical, legal, code generation—benefit from corpus-matched trees. In practice, building separate trees for each deployment adds complexity; most implementations use a single tree trained on the model's original training data distribution.

Implementation Architecture

Huffman compression targets two model components: the vocabulary index file (tokenizer) and the embedding weight table lookup. The tokenizer maps input text to token IDs; the embedding table maps token IDs to vectors. Both benefit from variable-length encoding, but the mechanics differ.

Tokenizer Index Compression

Standard tokenizers store a trie or hash map: string → token ID. With Huffman coding, token IDs become variable-length bitstrings. The lookup structure must support prefix-free decoding—no valid code is a prefix of another. A canonical Huffman tree stored as a length table (256 entries for 8-bit symbols, or 32K entries for full vocabulary) enables O(1) decoding with a lookup table approach.

// Pseudocode: Huffman decode with length table
struct HuffmanDecoder {
  uint16_t lengths[32768];  // Bit length for each token
  uint16_t codes[32768];    // Canonical code for each token
  uint16_t symbols[32768];  // Original token ID
};

uint16_t decode(BitStream& stream, HuffmanDecoder& dec) {
  uint32_t code = 0;
  for (int len = 1; len