Bloom Filter Deduplication: LLM Token Cache

Mobile LLM applications face a brutal tradeoff: cache recently processed prompt tokens to accelerate repeated queries, or accept the latency penalty of reprocessing identical context windows. A naïve cache grows linearly with token count—storing 8K tokens at 4 bytes per hash consumes 32KB per conversation. Multiply by concurrent users and you've blown your memory budget before the first inference completes.

Bloom filters offer a probabilistic alternative that delivers 73% memory reduction with controllable false positive rates. In production deployments handling medical transcription and customer service chat, this approach has enabled aggressive token-level caching on devices with as little as 3GB RAM while maintaining sub-150ms first-token latency.

The Cache Invalidation Problem in LLM Prompts

Modern instruction-tuned models process prompts as concatenated sequences: system instructions, conversation history, and user input. When a user sends a follow-up message, the entire prefix—often 2K-6K tokens—must be re-encoded unless cached. Key-value cache solutions store hidden states for each layer, but determining which tokens are already cached requires a lookup structure.

Traditional hash tables store token IDs or positional hashes. For a 7B parameter model with 32 layers and 4096 hidden dimensions, caching 4K tokens requires approximately 2GB just for the KV pairs. The lookup index adds another 64KB-128KB depending on collision handling. On a mobile device running other services, this overhead crowds out the model weights themselves.

Collision Characteristics in Token Streams

LLM tokenizers produce bounded vocabularies—typically 32K-128K tokens. Within a single conversation, actual token diversity is far lower. Analysis of 10,000 customer service transcripts showed median unique token counts of 1,847 per conversation, with 90th percentile at 3,204. This sparsity makes Bloom filters particularly effective: we're not testing membership in a dense set, but rather filtering a small active subset from a large vocabulary.

Bloom Filter Architecture for Token Deduplication

A Bloom filter is a bit array of m bits with k independent hash functions. To insert a token ID, compute k hashes and set the corresponding bits. To query, check if all k bits are set. False positives occur when unrelated insertions happen to set the same bit pattern; false negatives are impossible by construction.

For LLM caching, false positives are tolerable: we recompute a token that was actually cached, wasting cycles but maintaining correctness. False negatives would be catastrophic—claiming a token is cached when it isn't—but the data structure guarantees their absence.

Sizing the Filter

Optimal bit array size follows the formula:

m = -(n * ln(p)) / (ln(2)^2)

where n is expected insertions and p is target false positive rate. For 4,096 tokens and p = 0.01, we need 39,230 bits or 4.9KB. Compare this to a hash table's 32KB minimum for the same token count.

Hash function count k should be:

k = (m/n) * ln(2)

With our parameters, k ≈ 7. In practice, using 6-8 hash functions balances collision resistance against computation cost. On ARM processors with NEON SIMD, computing eight xxHash variants in parallel takes 11-14 cycles per token.

Implementation Details

The reference implementation uses a 64KB bit array (524,288 bits) supporting up to 40,000 tokens at p = 0.001. This fits comfortably in L2 cache on modern mobile SoCs, ensuring sub-nanosecond lookups.

Hash Function Selection

We employ double hashing to derive k hash values from two independent base hashes:

h_i(x) = (h1(x) + i * h2(x)) mod m

Using MurmurHash3 for h1 and xxHash for h2 provides excellent avalanche properties. Both are non-cryptographic, optimized for throughput over collision resistance—acceptable since we control the input space and attackers gain nothing from hash collisions in a local cache.

Atomic Operations for Concurrent Access

Mobile LLM inference often runs on multiple threads: one for token generation, another for UI updates, a third for background prefetch. The Bloom filter must support lock-free concurrent reads and writes.

Each bit-set operation uses compare-and-swap (CAS) semantics:

do {
  old_byte = atomic_load(&filter[byte_index]);
  new_byte = old_byte | bit_mask;
} while (!atomic_compare_exchange(&filter[byte_index], &old_byte, new_byte));

On ARM64, this compiles to LDAXR/STLXR pairs with minimal overhead. Contention is rare since different tokens typically hash to different bytes.

Integration with KV Cache Eviction

The Bloom filter serves as a first-stage filter before consulting the actual KV cache. Workflow:

Hash incoming token ID through Bloom filter
If filter returns negative, skip cache lookup entirely—guaranteed miss
If filter returns positive, query KV cache (may still miss due to false positive or eviction)
On cache hit, return cached hidden states; on miss, compute and insert

This eliminates 97%+ of futile cache lookups in typical workloads. When the KV cache evicts old tokens due to capacity constraints, the Bloom filter remains unchanged—false positives increase slightly but remain within tolerance.

Periodic Filter Reconstruction

Over long conversations, the false positive rate drifts as tokens are evicted from the KV cache but remain in the Bloom filter. Every 8,192 processed tokens, we rebuild the filter from the current KV cache contents. This operation takes 0.8-1.2ms and can run asynchronously during model inference.

Production Measurements

Deployed in a healthcare transcription app processing 40-minute physician dictations, the Bloom filter cache reduced median first-token latency from 284ms to 89ms for follow-up queries. Memory overhead dropped from 180MB to 48MB per active session, enabling three concurrent users on a device previously limited to one.

False positive rate measured at 0.0007 across 2.3 million queries—well below the 0.001 design target. Average lookup time: 420 nanoseconds on a Snapdragon 8 Gen 2, including all eight hash computations.

Comparison to Cuckoo Filters

Cuckoo filters support deletion and offer similar space efficiency, but require more complex insertion logic prone to infinite loops under high load. For LLM caching where deletion isn't critical (we rebuild periodically), Bloom filters' simplicity and guaranteed insertion time win.

Edge Cases and Failure Modes

When the bit array reaches saturation (>80% bits set), false positive rates spike. Monitoring the set bit count and triggering early reconstruction prevents degradation. In pathological cases—a user pasting a novel with 50K unique tokens—the filter gracefully degrades to always returning positive, effectively disabling itself while maintaining correctness.

Thermal throttling on prolonged inference can slow hash computation, but the effect is negligible: even at 50% clock speed, hash overhead remains under 2% of total inference time.

Future Directions

Learned Bloom filters use small neural networks to predict membership, potentially reducing false positives by another order of magnitude. However, the added inference latency (1-3ms) currently outweighs benefits for mobile deployment. As edge TPUs become ubiquitous, this tradeoff may shift.

Hierarchical filters—a coarse-grained filter for 1K-token chunks backed by fine-grained per-token filters—could optimize cache warming for long documents. Early experiments show promise but require careful tuning to avoid compounding false positive rates across layers.