Jitter Buffer Tuning for Low-Latency Speech Apps

Real-time speech applications—VoIP clients, telemedicine platforms, speech therapy tools—demand sub-200ms end-to-end latency while tolerating unpredictable network jitter. The jitter buffer sits at the heart of this trade-off: too small and late packets cause audible gaps; too large and conversational flow breaks down. Designing an adaptive jitter buffer that responds to network conditions without introducing perceptible delay requires understanding packet arrival statistics, playout scheduling, and human speech perception.

The Jitter Buffer Problem

IP networks deliver packets with variable delay. A packet sent at t=0ms might arrive at t=50ms, while the next arrives at t=120ms. Without buffering, the audio renderer would play samples immediately upon arrival, creating stuttering playback. The jitter buffer absorbs this variance by holding packets briefly before playout, smoothing arrival jitter into a steady stream.

The core challenge: network conditions change. A Wi-Fi handoff might spike jitter from 10ms to 80ms. A congested cell tower might drop jitter back to 15ms. Static buffers waste latency during good conditions and underrun during bad ones. Adaptive algorithms adjust buffer depth in real time, but naive implementations oscillate or react too slowly.

Target Delay and Playout Scheduling

Each arriving packet carries a timestamp and sequence number. The jitter buffer maintains a target delay—the time between packet arrival and playout. If a packet arrives at system time T with RTP timestamp R, it should play at T + target_delay, adjusted for clock drift and initial offset.

A fixed target delay of 100ms works well on stable networks but fails under two scenarios: sudden jitter spikes cause late packets (playout misses), and improved conditions leave 100ms of unnecessary latency. The algorithm must track arrival patterns and adjust the target dynamically.

Measuring Network Jitter

Effective adaptation requires robust jitter estimation. The RFC 3550 jitter calculation uses exponential smoothing of inter-arrival time variance:

D = (arrival_time - expected_arrival_time)
J = J + (|D| - J) / 16

This provides a smoothed estimate but reacts slowly to step changes. For speech applications, we need faster response. A percentile-based approach tracks the 95th percentile of recent inter-arrival deltas over a sliding 2-second window. When P95 exceeds current target delay by 20ms, the buffer expands; when it drops below target minus 30ms for 3 seconds, it contracts.

In a speech therapy app handling 20ms audio frames at 48kHz, we maintain a circular buffer of the last 100 inter-arrival measurements. Every 500ms, we compute P95 and P50. If P95 > target + 20ms, increment target by 10ms (capped at 200ms). If P50 < target - 30ms for 6 consecutive checks, decrement by 10ms (floored at 40ms). This provides 3-second attack, 1.5-second decay—fast enough to adapt to handoffs, slow enough to avoid chasing noise.

Handling Late Packets

Even with adaptation, packets arrive late. When playout time passes before a packet arrives, we have three options: skip the frame (silence), conceal with PLC, or insert the late packet if the delay is small. For speech, = 100 { let p95 = arrivalTimes.percentile(0.95) if p95 > Double(targetDelayMs) + 20 { targetDelayMs = min(targetDelayMs + 10, 200) } } packets.enqueue(packet) } func dequeue(playoutTime: Double) -> AudioPacket? { guard let packet = packets.peek() else { return nil } let scheduledTime = packet.arrivalTime + Double(targetDelayMs) / 1000 if playoutTime >= scheduledTime { return packets.dequeue() } return nil } }

The Audio Unit callback requests 512 samples every 10.7ms at 48kHz. The jitter buffer dequeues packets when their scheduled playout time arrives, mixing decoded frames into the output buffer. If no packet is ready, PLC generates synthetic audio.

Measuring Real-World Performance

In production VoIP apps, we instrument three metrics: P95 end-to-end latency, underrun rate (gaps per minute), and late-packet discard rate. On stable Wi-Fi, adaptive buffering achieves 60-80ms latency with