Most mobile speech-to-text implementations record at a fixed 16 kHz sample rate, streaming raw or compressed audio to a cloud STT service. This works—until network conditions degrade, battery drains, or users move between WiFi and cellular. Adaptive bitrate (ABR) streaming, common in video delivery, rarely appears in audio ML pipelines. Yet the same principle applies: detect link quality, adjust encoding parameters, maintain user experience.
This article explores adaptive sample-rate switching for real-time STT on mobile devices. We cover detection heuristics, resampling without artifacts, model compatibility, and measured impact on accuracy and bandwidth. Production data from a speech therapy app with 40,000 daily sessions informed these numbers.
Why Fixed 16 kHz Fails Under Constraint
Speech recognition models trained on 16 kHz audio (Whisper, Conformer, Wav2Vec2) expect that rate. But streaming 16 kHz mono PCM at 16-bit depth consumes 256 kbit/s uncompressed—or 32–64 kbit/s with Opus. On congested LTE or rural 3G, packet loss climbs above 5%, triggering retransmits and jitter. Battery drain accelerates: continuous microphone sampling, encoding, and socket I/O on cellular radios can pull 400–600 mW sustained.
Telephony systems have operated at 8 kHz (narrowband) for decades. Most phonemes remain distinguishable below 4 kHz; fricatives and sibilants suffer, but intelligibility holds. Modern STT models tolerate 8 kHz input if the training corpus included narrowband samples or if you apply upsampling with spectral mirroring. The tradeoff: 50% bandwidth reduction, lower CPU load, but 2–4% relative word error rate (WER) increase in quiet conditions. Under packet loss >3%, the gap narrows—8 kHz with fewer lost frames often outperforms 16 kHz with retransmits.
Detection Heuristics: When to Downshift
Adaptive switching requires a decision function. Three signals matter:
- Round-trip time (RTT): Measure WebSocket ping latency every 5 seconds. Threshold: 200 ms for 16 kHz, 150 ms for 8 kHz. Rising RTT indicates congestion.
- Packet loss: Track acknowledged vs. sent chunks. Loss >2% over a 10-second window triggers downshift.
- Battery state: On iOS, query
UIDevice.current.batteryLevel. Below 20%, prefer 8 kHz unless WiFi and plugged in.
A simple state machine suffices: start at 16 kHz, downshift to 8 kHz if any threshold breaches, upshift after 30 seconds of stable conditions. Hysteresis prevents oscillation. In a Flutter app using platform channels, this logic runs in a Dart isolate, reading native battery and network APIs every 5 seconds. Overhead: negligible—one syscall per poll.
Sample Implementation Sketch
class AdaptiveSTTController {
SampleRate _current = SampleRate.k16;
Timer? _pollTimer;
int _lossCount = 0;
int _sentCount = 0;
void start() {
_pollTimer = Timer.periodic(Duration(seconds: 5), (_) {
final rtt = _measureRTT();
final loss = _lossCount / _sentCount;
final battery = _getBatteryLevel();
if (_current == SampleRate.k16 &&
(rtt > 200 || loss > 0.02 || battery < 0.2)) {
_switchTo(SampleRate.k8);
} else if (_current == SampleRate.k8 &&
rtt < 150 && loss < 0.01 && battery > 0.3) {
_switchTo(SampleRate.k16);
}
});
}
void _switchTo(SampleRate rate) {
_current = rate;
_reconfigureAudioSession(rate);
_notifySTTService(rate);
}
}Resampling Without Artifacts
Switching sample rates mid-stream risks clicks, pops, or phase discontinuities. The audio pipeline must resample atomically between encoded chunks, not mid-frame. On iOS, AVAudioConverter handles polyphase resampling with zero-padding; on Android, use Resampler from Oboe or a hand-rolled sinc interpolator.
Key: maintain a small overlap buffer (64 samples) across the transition. When downshifting from 16→8 kHz, the last 64 samples at 16 kHz are resampled to 32 samples at 8 kHz and prepended to the new chunk. This preserves phase continuity. Upshifting reverses the process. The STT model sees a single contiguous stream, unaware of the switch.
Measured artifact rate in production: