Speech Recognition Latency: 60ms End-to-End

When you're building a speech therapy app for children with articulation disorders, latency isn't just a performance metric—it's the difference between effective feedback and confusion. At 200ms roundtrip, a child sees their waveform visualization lag behind their voice. At 60ms, the experience feels instantaneous. This article dissects the full pipeline from ADC to screen refresh, covering the architectural decisions that matter when every buffer size and thread priority affects clinical outcomes.

The Complete Latency Budget

Most speech recognition latency discussions focus on model inference time. That's misleading. In a production mobile app, inference is one slice of a seven-stage pipeline:

Audio capture buffering: 10-20ms (hardware + OS scheduler)
Preprocessing: 5-15ms (resampling, VAD, normalization)
Feature extraction: 8-12ms (mel-spectrograms, MFCCs)
Model inference: 15-35ms (the part everyone measures)
Post-processing: 3-8ms (beam search, language model fusion)
UI thread dispatch: 2-5ms (main thread scheduling)
Render pipeline: 8-16ms (layout, draw, vsync)

Total: 51-111ms. To consistently hit sub-60ms, you need to optimize every stage, not just throw a faster model at the problem. When building KidzCare, the speech therapy platform, we discovered that audio capture settings alone could swing total latency by 40ms.

Audio Capture: The Hidden Bottleneck

iOS and Android expose different knobs. On iOS, AVAudioSession lets you request preferred buffer durations, but the actual granted buffer depends on hardware, sample rate, and system load. Request 5ms at 16kHz (80 samples) and you might get 10ms (160 samples). At 48kHz, that same 10ms buffer is 480 samples—more data to process before you can even start feature extraction.

The tradeoff: smaller buffers mean lower latency but higher CPU wake frequency and potential audio glitches under load. Larger buffers are robust but add 15-20ms to the capture stage alone. We settled on 512 samples at 16kHz (32ms buffer) after profiling on iPhone SE 2020 under thermal throttling—the device most likely to drop frames in a school setting.

Android's AudioRecord requires explicit buffer sizing. Too small and you get BUFFER_OVERFLOW errors; too large and latency balloons. The minimum safe buffer is AudioRecord.getMinBufferSize(), but that's often 2-3x what you need for low-latency capture. We use a custom ring buffer with double-buffering: one buffer fills while the other processes, reducing effective latency to ~1.5x the hardware buffer size instead of 2x.

Real-Time Thread Priority

On both platforms, audio capture callbacks run on dedicated threads. iOS defaults to a high-priority thread; Android requires you to set Process.setThreadPriority(Process.THREAD_PRIORITY_URGENT_AUDIO). Miss this and your preprocessing can be preempted by UI animations, adding 10-30ms of jitter. We also pin the inference thread to performance cores on Android (via sched_setaffinity) to avoid migration overhead during phoneme detection windows.

Feature Extraction: Batching vs. Streaming

Most speech models expect 25ms windows with 10ms stride—overlapping frames for temporal context. Naive implementations wait for 25ms of audio, compute features, then pass to the model. That's already 25ms of unnecessary latency. Instead, maintain a sliding window and compute features incrementally as each 10ms chunk arrives:

// Pseudocode for streaming feature extraction
ringBuffer.append(newSamples)
if ringBuffer.size >= windowSize {
  features = computeMelSpec(ringBuffer.lastN(windowSize))
  model.feedFrame(features)
  ringBuffer.slideBy(strideSize)
}

This approach adds only the stride duration (10ms) to latency, not the full window. The catch: you need careful buffer management to avoid allocations in the hot path. Pre-allocate all FFT buffers and reuse them; even a single 4KB malloc can add 2-3ms on a busy system.

Model Inference: Quantization and Delegation

We tested five on-device ASR architectures: Wav2Vec 2.0, Whisper Tiny, Conformer-CTC, QuartzNet, and a custom streaming RNN-T. Whisper Tiny (39M params) gave the best accuracy but 45ms inference on iPhone 12. QuartzNet (19M params) ran in 18ms but struggled with non-native accents. We landed on a quantized Conformer-CTC (23M params, int8) that runs in 22ms on Apple Neural Engine and 28ms on Snapdragon 8 Gen 1.

Key optimization: layer-wise quantization awareness. Uniform int8 quantization degraded WER by 8% on child speech (higher pitch variance). Hybrid quantization—fp16 for attention layers, int8 for convolutions—kept WER delta under 2% while maintaining 25ms inference. Use coremltools with compute_precision=mlprogram.ComputePrecision.FLOAT16 for attention blocks.

Streaming vs. Buffered Inference

Non-streaming models (like Whisper) require the full utterance before transcription. For a 2-second phrase, you wait 2 seconds plus inference time—unacceptable for real-time feedback. Streaming models (RNN-T, CTC) emit partial hypotheses every 30-50ms. The tradeoff: streaming models often have 10-15% higher WER because they lack future context. We mitigate this with a two-pass system: streaming model for live feedback, full-context model for final scoring after the utterance ends.

Post-Processing: Beam Search Budgets

CTC models output per-frame logits; you need beam search to decode the most likely sequence. Beam width trades accuracy for speed. We profiled beam widths from 1 to 32:

Beam=1 (greedy): 2ms, +3.2% WER
Beam=4: 5ms, +0.8% WER
Beam=16: 12ms, baseline WER
Beam=32: 23ms, -0.1% WER (not worth it)

We use beam=4 for live transcription and beam=16 for final scoring. The implementation matters: a naive Python-style beam search with list operations takes 18ms; a cache-friendly C++ version with pre-allocated arrays takes 5ms for the same beam width.

UI Thread Dispatch: Avoiding Main Thread Stalls

Once you have a transcript, you need to update the UI. On iOS, DispatchQueue.main.async adds 1-4ms if the main thread is idle, but 10-50ms if it's busy with layout or animation. We use a dedicated display link callback (CADisplayLink) synced to 60Hz (16.67ms intervals) to batch UI updates. This guarantees updates happen at vsync boundaries, eliminating tearing and reducing perceived latency.

On Android, Choreographer.postFrameCallback achieves the same effect. Critical: mark your transcription view as View.LAYER_TYPE_HARDWARE to move rasterization off the main thread. Without hardware layers, even a simple TextView.setText() can take 8-12ms because it triggers a full layout pass.

Render Pipeline: Vsync and Triple Buffering

The final stage is getting pixels on screen. At 60Hz, each frame has a 16.67ms budget. If your UI update lands 1ms after vsync, you wait another 16ms for the next frame—adding 15ms of perceived latency. We use a custom SurfaceView on Android with explicit vsync synchronization to ensure transcription updates land within 2ms of frame start.

On iOS, CAMetalLayer with presentsWithTransaction = false and displaySyncEnabled = true gives similar control. We also pre-render common phoneme visualizations (waveforms, spectrograms) into texture atlases to avoid per-frame draw calls. This dropped render time from 11ms to 4ms on iPhone SE.

Measuring End-to-End Latency

To validate the full pipeline, we built a hardware test rig: a speaker playing reference audio into a microphone, with a high-speed camera (240fps) filming both the audio waveform on an oscilloscope and the phone screen. Frame-by-frame analysis showed:

Before optimization: 127ms average, 89ms best case, 203ms worst case
After optimization: 58ms average, 51ms best case, 78ms worst case

The worst-case improvement came from fixing a bug where the beam search occasionally re-allocated its hypothesis buffer, causing a 40ms stall. Lesson: average latency is misleading; p95 and p99 matter more for perceived responsiveness.

Tradeoffs and Future Directions

Hitting 60ms required sacrificing some accuracy. Our streaming model has 2.1% higher WER than the offline version. For speech therapy, that's acceptable—clinicians care more about phoneme-level timing than perfect transcription. For dictation apps, the tradeoff might not work.

Looking ahead, on-device model compilation (CoreML's mlprogram format, TFLite's XNNPack delegation) will likely push inference under 15ms for models in the 30-50M parameter range. The next bottleneck will shift back to audio capture and feature extraction—areas where OS-level improvements (like Android's AAudio low-latency API) will matter more than application-level optimizations.

For now, the recipe is clear: measure every stage, optimize the slowest, repeat. Latency is a systems problem, not a model problem.