The Problem: Embedding Databases Don't Fit Mobile Constraints
Retrieval-augmented generation (RAG) has become table stakes for production LLM applications. Cloud deployments reach for Pinecone, Weaviate, or pgvector. Mobile is different. Shipping a 40MB Qdrant binary or maintaining a persistent gRPC connection to Chroma defeats the purpose of on-device inference. Users expect offline operation, sub-50ms query latency, and app bundles under 100MB.
SQLite is already on every mobile device. iOS ships version 3.39+, Android bundles 3.32+ depending on API level. The question isn't whether to use SQLite for document storage—it's whether you can make it fast enough for approximate nearest neighbor (ANN) search over 384-dimensional sentence embeddings without native extensions.
Virtual Tables as a Fusion Layer
SQLite's virtual table API lets you expose arbitrary data sources as SQL tables. The xBestIndex callback receives the query planner's constraints and returns a cost estimate. The xFilter callback executes the scan. Most developers use this for CSV files or REST APIs. We can use it to fuse vector similarity with relational predicates.
Consider a knowledge base with 12,000 product documents. Each has a 384-float embedding from MiniLM-L6-v2 (91MB ONNX model, 18ms encode time on iPhone 13). Naive approach: fetch all rows, compute cosine similarity in-memory, sort, take top-K. This scans 4.4MB of floats per query. On a Pixel 6, that's 140ms just for memory bandwidth.
Hierarchical Navigable Small Worlds in SQLite
HNSW builds a multi-layer proximity graph. Layer 0 is the full dataset. Higher layers are exponentially sparser. Search starts at the top, greedily descends, then explores neighbors at layer 0. We store the graph structure in three SQLite tables: vectors (id, layer, embedding_blob), edges (from_id, to_id, layer), and metadata (id, category, price, in_stock).
The virtual table vector_search accepts a query embedding and K. xBestIndex signals zero cost if the WHERE clause includes embedding = ? and LIMIT is present—this tells SQLite to route the query through our index instead of scanning. xFilter walks the HNSW graph, collecting candidate IDs, then yields rows in similarity order.
Critical detail: we don't materialize the full result set. SQLite's iterator protocol means we can yield the first row after visiting ~80 nodes (measured on our 12K corpus). If the application only renders top-3 results, we stop early. This turns O(N) into O(log N × ef_search), where ef_search is the exploration factor (typically 32-64).
Hybrid Queries: Filtering Before Vector Search
RAG in production often needs metadata filters. "Find similar products under $50 that are in stock." Naive approach: filter in SQL, serialize embeddings to application code, compute similarity. This breaks down when the filter is selective (2% match rate) but you still need to examine thousands of vectors.
Fusion tables invert this. The virtual table's xBestIndex can inspect other constraints in the WHERE clause. If we see price < 50 AND in_stock = 1, we pass those to SQLite's query planner, which uses the metadata table's indices. The planner returns a bitmap of eligible IDs. Our HNSW traversal skips nodes not in the bitmap.
Measured impact on a 12K product catalog: unfiltered ANN search averages 18ms. Adding price < 50 (15% selectivity) increases latency to 22ms—only 22% overhead despite the predicate. The key is that HNSW's graph structure naturally clusters similar items, so category/price filters often align with vector neighborhoods. We're not checking every node; we're pruning entire graph regions.
Quantization: 8-bit Embeddings in 1.5MB
384 floats × 4 bytes × 12,000 docs = 18.4MB. Too large for cold-start latency on older devices. We quantize to 8-bit integers using asymmetric min-max scaling per dimension. The codebook (768 floats for min/max pairs) lives in a separate table. Dequantization happens on-demand during distance computation.
Storage drops to 4.6MB (384 bytes × 12K). Accuracy loss is measurable but acceptable: recall@10 goes from 0.94 to 0.91 on our eval set (500 queries against BEIR/scifact). More importantly, cache locality improves—we fit 3× more vectors in L2 cache, reducing DRAM roundtrips. On iPhone 13, quantized search averages 14ms versus 18ms for float32.
We don't quantize the query embedding. The user's input is encoded once per query; the bottleneck is scanning stored vectors. Asymmetric distance (float query, quantized corpus) preserves most of the accuracy while keeping the hot path fast.
Reranking with Cross-Encoders
HNSW with sentence embeddings (bi-encoders) retrieves candidates. For the top-K, we apply a cross-encoder: a BERT model that scores query-document pairs jointly. This is expensive—55ms per pair on our hardware—but only runs on 10-20 candidates.
The fusion table yields rows in approximate similarity order. We collect the first 20, batch them through the cross-encoder (ONNX Runtime with CoreML delegate), then re-sort. Total pipeline: 18ms HNSW + 12ms batch inference + 1ms sort = 31ms. The LLM receives the top-3 reranked documents as context, improving answer relevance by ~30% versus bi-encoder alone (measured via human eval on 200 support queries).
SQLite's role here is orchestration. The virtual table doesn't know about cross-encoders; it just yields candidates. The application layer decides whether to rerank. For low-stakes queries ("What's your return policy?"), we skip reranking. For high-stakes ("Is this medication safe with aspirin?"), we rerank and add a confidence threshold.
Incremental Index Updates
Users add bookmarks, save articles, or sync new products. Rebuilding the HNSW graph on every insert would take seconds. Instead, we buffer new vectors in a flat table (pending_vectors) and use a two-tier search: query the HNSW index, query the flat buffer, merge results.
When pending_vectors exceeds 500 rows (tuned empirically), we trigger a background rebuild. This takes 800ms for our 12K corpus on iPhone 13. We write the new graph to vectors_v2 and edges_v2, then atomically swap table names in a transaction. Queries block for 512 dimensions, HNSW in SQLite starts losing to native vector databases. Memory overhead grows as O(M × avg_degree), where M is the number of layers (typically log₂(N)) and avg_degree is 16-32. At 100K docs, you're storing 3-4 million edges, which bloats the SQLite file to 80MB+ even with quantization.
Highly dynamic corpora (>10% churn per hour) make incremental updates expensive. Rebuilding a 50K-node graph takes 5+ seconds. At that scale, consider a hybrid: SQLite for metadata and recent documents, a native HNSW library (hnswlib, faiss) for the bulk corpus, with application-level merging.
If you need exact KNN (not ANN), don't use HNSW. Use a flat index or ball tree. SQLite's R-tree module can work for low-dimensional embeddings (≤16D) but degrades quickly beyond that. For exact search over 384D, you're back to scanning all rows—at which point, just use a simple SELECT and compute distances in application code.
The Fusion Advantage
The core insight: SQLite is already your source of truth for structured data. Making it also handle vector search eliminates an entire subsystem. No inter-process serialization, no dual write paths, no consistency headaches. The virtual table API is stable (unchanged since 3.7.0 in 2010) and incurs near-zero abstraction cost—measured overhead is 0.3ms per query versus a native C++ HNSW implementation.
For mobile LLM applications that need offline RAG, fusion tables hit the sweet spot: good-enough accuracy (recall@10 > 0.90), excellent latency (sub-20ms on modern hardware), and zero external dependencies. The entire implementation is 1,200 lines of C with SQLite's amalgamation, compiling to 280KB of arm64 code. That's the difference between shipping and not shipping on mobile.