Packet Loss Concealment: WebRTC Audio at 8% Drop

The 5% Threshold and Why It Matters

WebRTC audio codecs like Opus are remarkably resilient, but once packet loss climbs above 5%, perceptual quality degrades rapidly. In production P2P voice applications, network conditions frequently spike to 8-12% loss during handoffs, congestion events, or poor last-mile connectivity. Without packet loss concealment (PLC), users hear stuttering, robotic artifacts, or complete dropouts that destroy conversation flow.

The challenge: maintain intelligible speech when 1 in 12 packets never arrives, using only the receiver-side audio buffer and codec state. No retransmissions are possible—latency budgets for real-time voice cap round-trip time at 150-200ms, leaving no room for TCP-style recovery.

Codec-Native PLC: Opus FEC and In-Band Redundancy

Opus includes forward error correction (FEC) that embeds a lower-bitrate copy of the previous frame inside the current packet. When a packet is lost, the decoder extracts this redundant copy from the next successfully received packet. This introduces one frame of additional latency (typically 20ms) but provides bit-exact recovery rather than synthesis.

Enabling FEC in libopus requires setting OPUS_SET_INBAND_FEC and OPUS_SET_PACKET_LOSS_PERC to the expected loss rate. The encoder then allocates 20-30% more bandwidth to redundancy. At 32kbps voice, FEC overhead pushes bitrate to ~40kbps—acceptable for most mobile networks, but problematic on metered or congested links.

The tradeoff: FEC only helps if the next packet arrives. In burst-loss scenarios (common during Wi-Fi roaming), consecutive packets disappear and FEC data is lost alongside primary frames. Empirical testing on mobile networks shows FEC effectiveness drops below 60% once loss becomes bursty with runs of 3+ consecutive drops.

Waveform Extrapolation: Phase Vocoder Stretching

When FEC fails or isn't available, the decoder must synthesize missing audio from the last successfully decoded frame. The simplest approach—repeating the final 20ms buffer—creates obvious periodicity artifacts. A more sophisticated method uses phase vocoder time-stretching to extend the tail of the previous frame while preserving pitch and formant structure.

The algorithm operates in the frequency domain: apply a short-time Fourier transform (STFT) with 50% overlap, duplicate magnitude bins while advancing phase estimates linearly based on instantaneous frequency, then inverse-transform. This stretches the waveform by 1.5-2× without pitch shift, masking the gap until the next packet arrives.

Implementation requires maintaining a 40ms sliding window (two 20ms frames) and computing a 512-point FFT. On mobile, this fits comfortably in L1 cache and executes in under 2ms on ARM Cortex-A55 cores. Phase vocoder PLC produces noticeably smoother output than frame repetition, especially for voiced speech segments where pitch continuity matters.

Handling Unvoiced Consonants

Phase vocoder stretching fails for unvoiced phonemes like /s/, /t/, /k/ because these lack periodic structure. A hybrid approach detects voicing via zero-crossing rate and spectral flatness: if the last frame is unvoiced (ZCR > 0.3 and flatness > 0.6), switch to comfort noise generation instead. Generate white noise, shape it with the spectral envelope from the last frame's LPC coefficients, and crossfade over 5ms to avoid clicks.

This dual-mode PLC preserves both tonal speech and fricatives. In A/B testing on SafeChat's WebRTC stack, users rated hybrid PLC as "acceptable" up to 8% loss, versus 5% for phase vocoder alone.

Adaptive Jitter Buffer: Trading Latency for Resilience

PLC only buys time—it cannot recover indefinitely. The jitter buffer must adapt its target depth based on observed loss patterns. A fixed 60ms buffer works well at