Jitter Buffers for WebRTC: Playout Delay Tuning

WebRTC's promise of sub-second latency hinges on a component most developers treat as a black box: the jitter buffer. When packets arrive out-of-order or delayed across cellular networks, the jitter buffer reorders and smooths them before playout. Too small a buffer and you hear glitches; too large and conversational interactivity dies. In production P2P voice apps—where users tolerate 150ms one-way latency but not robotic artifacts—tuning playout delay becomes the difference between a five-star review and an uninstall.

Why Jitter Buffers Exist

UDP delivers no ordering guarantees. A packet sent at t=0ms might arrive at t=120ms, while one sent at t=20ms arrives at t=80ms. The audio decoder expects samples in strict sequence. The jitter buffer sits between the network and decoder, holding packets until their scheduled playout time. It absorbs variance in inter-arrival time—jitter—at the cost of added latency.

On Wi-Fi, jitter is typically 5–15ms. On LTE, it spikes to 40–80ms during handoffs. 5G mmWave can hit 200ms+ when transitioning to sub-6GHz. A fixed 50ms buffer handles Wi-Fi but underruns on LTE; a 200ms buffer survives LTE but feels laggy on Wi-Fi. Adaptive algorithms adjust the target delay based on observed network conditions.

Playout Delay Components

Total playout delay has three parts:

Minimum delay: The floor below which the buffer won't shrink, typically 20–40ms to absorb packet size variation and scheduling jitter in the OS.
Target delay: The operating point, recalculated every 1–5 seconds based on packet loss and jitter statistics.
Maximum delay: The ceiling, often 400–600ms, beyond which the buffer discards old packets to prevent runaway growth.

WebRTC's NetEQ (used in Chrome and libwebrtc) defaults to a minimum of 0ms, target of 60ms, and maximum of 2000ms. These are wildly conservative for mobile voice. In SafeChat, a P2P encrypted voice app shipping to 80,000+ users across the Middle East, we set minimum=30ms, target=80ms, maximum=300ms. The 80ms target kept 95th-percentile mouth-to-ear latency under 180ms on LTE, while the 30ms floor prevented underruns during brief Wi-Fi congestion.

Adaptive Algorithm Mechanics

NetEQ recalculates target delay using an exponentially weighted moving average of inter-arrival time variance. Simplified:

jitter_estimate = 0.95 * jitter_estimate + 0.05 * abs(arrival_delta - 20ms)
target_delay = min_delay + 4 * jitter_estimate

The multiplier (4×) trades off resilience and latency. Lower values (2–3×) reduce delay but increase underrun probability. Higher values (5–6×) eliminate underruns but feel sluggish. The optimal multiplier depends on acceptable packet loss: 4× targets ~0.1% loss, 5× targets ~0.01%.

NetEQ also tracks packet loss rate over a 5-second window. If loss exceeds 5%, it bumps target delay by 20ms and holds it for 10 seconds—a hysteresis mechanism to avoid oscillation. If loss drops below 2% for 30 seconds, it decays target delay by 10ms. This slow decay prevents the buffer from shrinking too aggressively after a brief network hiccup.

Tuning for Mobile Networks

Cellular networks exhibit bimodal jitter: stable for minutes, then a 200ms+ spike during handoff. A reactive algorithm that adjusts every second will bloat the buffer during the spike and take 30+ seconds to shrink back. A better approach:

Detect handoff events: Monitor round-trip time (RTT) from RTCP reports. A sudden 100ms+ RTT jump signals a handoff. Temporarily raise maximum delay to 500ms for 5 seconds, then restore it to 300ms.
Fast decay on stable jitter: If jitter stays below 30ms for 10 consecutive seconds, halve the decay time from 30s to 15s. This lets the buffer shrink quickly after returning to Wi-Fi.
Clamp target during silence: Voice activity detection (VAD) marks silent periods. During silence, freeze target delay at its current value. This prevents the buffer from shrinking during a pause and then underrunning when speech resumes.

In SafeChat, we added a handoff detector that watches RTCP RTT. When RTT jumped from 40ms to 180ms (a 4.5× increase), we raised maximum delay to 450ms for 8 seconds. This absorbed the handoff spike without permanent latency increase. Post-handoff, the buffer decayed from 240ms back to 80ms over 20 seconds—fast enough that users didn't notice.

Underrun Recovery

When the buffer empties mid-packet, the decoder starves. NetEQ handles this with packet loss concealment (PLC): it synthesizes audio by extrapolating the last received frame's pitch and energy. PLC is perceptually acceptable for 20–40ms but degrades into robotic droning beyond 60ms.

The naive fix—jump target delay up by 50ms—causes a 50ms audio glitch as the buffer re-fills. A smoother approach: accelerate playout by 2–5% until the buffer reaches target. NetEQ calls this time-stretching. At 2% acceleration, a 40ms gap closes in 2 seconds—imperceptible in voice. Beyond 5%, pitch distortion becomes noticeable.

We implemented a two-tier recovery:

If buffer drops below 10ms, accelerate by 4% until it hits 30ms (minimum delay).
If buffer drops to zero (hard underrun), jump target delay by 20ms and accelerate by 2% until target is reached.

This recovered from brief underruns without glitches and from hard underruns with a single 20ms stutter—preferable to 100ms of PLC droning.

Measuring Success

Three metrics matter:

95th-percentile playout delay: Should stay under 150ms for conversational use. Median delay is misleading because it hides tail behavior.
Underrun rate: Percentage of 20ms frames that arrive late. Target