Shipping multiple specialized LLMs in a single mobile app—a 1.5B summarizer, a 3B chat model, a 400M classifier—raises an immediate question: which model handles which user request? A naive approach routes everything to the largest model or forces users to pick. Both waste compute and degrade UX. The solution is multi-model routing: a lightweight decision layer that dispatches tasks to the right model in under 100ms, preserving the illusion of a single intelligent system.

This pattern emerged in production when building OfflineAI, where users expected natural language queries to trigger summarization, Q&A, or sentiment analysis without manual mode switching. The routing layer had to run entirely on-device, consume minimal battery, and fail gracefully when confidence was low. Here's the architecture that shipped.

Why Multiple Models Beat One Large Model

A single 7B parameter model can theoretically handle every task, but on mobile it's a poor tradeoff. Memory pressure forces aggressive quantization (often 4-bit), which degrades output quality. Inference latency climbs to 800ms–2s per token on midrange devices. Battery drain becomes user-visible within 20 minutes of sustained use.

Specialized models exploit task structure. A 400M BERT-style classifier fine-tuned on intent detection runs in 40ms and consumes 12% of the power budget of a 7B generative model. A 1.5B summarizer with a 512-token context window outperforms a quantized 7B model on document condensation because it preserves more precision in its smaller weight matrices. The engineering challenge is orchestrating these models without adding latency or complexity.

Three-Tier Routing Architecture

The routing stack has three layers: a fast intent classifier, an embedding-based similarity check, and a confidence-gated fallback chain. Each layer runs sequentially, with early exits to minimize compute.

Tier 1: Intent Classification

A 400M DistilBERT model fine-tuned on 8 intent classes (summarize, question, sentiment, translate, rewrite, define, compare, other) processes the first 128 tokens of user input. The model outputs a softmax distribution over intents in 35–50ms on an iPhone 12. If the top class exceeds 0.85 confidence, routing stops here and dispatches to the corresponding specialist model.

The classifier uses llama.cpp's GGML format with 8-bit quantization. Training data came from 40,000 labeled examples across product support tickets, forum posts, and synthetic queries generated by GPT-4. The key insight: intent classification is a much easier problem than generation, so a small model achieves 94% accuracy while the larger models focus on output quality.

Tier 2: Embedding Similarity

When intent confidence falls below 0.85, the router computes a 384-dimensional sentence embedding using a MiniLM model (22MB, 15ms latency). This embedding is compared via cosine similarity against 200 cached exemplar embeddings representing canonical queries for each model. For example, the summarizer's exemplars include "give me the key points," "what's the tl;dr," and "condense this document."

If any exemplar similarity exceeds 0.78, the query routes to that model's domain. This tier catches paraphrases and informal language the classifier misses. Exemplars are stored in a memory-mapped file and updated via over-the-air config, allowing post-launch tuning without app updates.

Tier 3: Fallback Chain

If both tiers fail (confidence