Convolution reverb—the gold standard for realistic acoustic simulation—has historically been too expensive for real-time mobile audio. A 2-second impulse response at 48kHz yields 96,000 samples; naive time-domain convolution requires O(N×M) multiply-accumulates per output sample, making sub-10ms latency impossible on battery-constrained devices. Yet when building clinical hearing aids like HearingAid Pro, users expect transparent environmental processing with spatial realism. The solution lies in precomputed impulse responses, frequency-domain convolution, and careful partitioning.

Why Convolution Reverb Matters in Hearing Aids

Traditional algorithmic reverbs (Schroeder, Freeverb) use cascaded comb and allpass filters to approximate room acoustics. They're cheap—under 5% CPU—but sound metallic and fail to reproduce real-world decay patterns. For hearing aid users adjusting to amplification, unnatural reverb tail artifacts trigger cognitive fatigue. Convolution reverb samples actual room impulse responses: you record a space's acoustic signature (typically with a sine sweep or MLS burst), deconvolve it, and use the resulting IR as a filter kernel. The output is indistinguishable from being in that space.

The challenge: a 1.5-second IR at 48kHz is 72,000 taps. Direct convolution at 256-sample block size costs 72,000 × 256 = 18.4M MACs per block. At 187 blocks/second, that's 3.4 billion operations/second—half the thermal budget of an iPhone 12's DSP alone, before any gain compensation or feedback suppression.

Offline IR Preprocessing Pipeline

The first optimization happens before the app ships. Raw impulse responses from measurement rigs contain DC offset, pre-ring (causality violations from time-alignment errors), and out-of-band noise. A preprocessing pipeline cleans and optimizes:

  • Trim pre-ring: Cross-correlate with the original stimulus, find the true t=0, truncate everything before.
  • Fade tails: Apply exponential decay window starting at -60dB point to eliminate quantization noise floor.
  • Resample to target rate: If the device audio graph runs at 48kHz, sinc-resample the IR once offline rather than on every frame.
  • Normalize energy: Scale so RMS power matches a reference to prevent volume jumps when switching IRs.

For a concert hall IR, this typically reduces 4 seconds of raw capture to 1.2 seconds of usable kernel, saving 40% memory and compute. Store as 16-bit PCM; 24-bit offers no perceptual benefit for reverb tails below -40dB.

Frequency-Domain Convolution with Overlap-Add

The breakthrough is the convolution theorem: convolution in time equals multiplication in frequency. FFT the input block, FFT the IR once at init, complex-multiply, IFFT back. For an N-sample block and M-tap IR, this is O((N+M)log(N+M)) instead of O(N×M). At block size 512 and IR length 48,000, time-domain costs 24.5M MACs; FFT-based costs ~60K—a 400× speedup.

The catch: FFT convolution is circular. You need overlap-add (OLA) or overlap-save to handle block boundaries. OLA partitions the IR into L-sample chunks, convolves each with the input block, and sums the overlapping tails. For mobile, uniform partitioning works well:

IR_chunks = [IR[0:512], IR[512:1024], IR[1024:1536], ...]
for each input block:
  FFT(input + zero_padding)
  for chunk in IR_chunks:
    result += IFFT(FFT(input) * FFT_precomputed(chunk))
  output = result[0:512]
  overlap_buffer = result[512:1024]

Precompute FFT(chunk) once at app launch and cache in memory. For a 1-second IR at 512-sample partitions, that's 94 chunks × 1KB FFT = 94KB. Acceptable on modern devices.

Partition Size Selection

Smaller partitions reduce latency but increase FFT overhead. Larger partitions amortize FFT cost but delay the tail. A hybrid approach uses small partitions (256 samples) for the first 100ms of IR—capturing early reflections with