Mel-frequency cepstral coefficients (MFCCs) remain the workhorse feature vector for speech recognition, speaker verification, and acoustic analysis in mobile audio apps. The canonical pipeline—preemphasis, windowing, FFT, mel-filterbank, log, DCT—typically runs frame-by-frame at 10–25ms intervals. On a modern ARM SoC, a naive implementation consumes 1.2–1.8ms per 25ms frame for 13 coefficients at 16kHz, leaving scant budget for downstream inference or UI rendering. For real-time speech therapy apps like KidzCare or hearing aid DSP in HearingAid Pro, every microsecond counts.

The Redundant Memory Round-Trip

Standard MFCC libraries execute FFT and DCT as separate passes. After the FFT produces complex spectrum bins, the code writes mel-weighted log-magnitudes to a scratch buffer, then the DCT reads that buffer back. On ARM Cortex-A76 or Apple A-series cores with 64KB L1 data cache, this round-trip evicts hot loop data and stalls the pipeline. Profiling with Instruments on iPhone 13 Pro shows FFT→DCT memory bandwidth accounting for 38% of total MFCC cycles.

The insight: mel-filterbank summation and DCT basis multiplication are both linear operations over frequency bins. We can fuse them into a single kernel that streams FFT outputs directly into DCT accumulators, eliminating the intermediate buffer.

Fused Kernel Design

The fused approach precomputes a mel_dct_matrix of shape [num_coeffs, fft_bins] that combines mel-triangular weights and DCT-II basis functions. For 13 MFCCs, 512-point FFT, and 40 mel bands, this matrix is 13×257 floats (13.4KB, fits comfortably in L1). The kernel looks like:

for c in 0..num_coeffs:
  acc = 0.0
  for k in 0..fft_bins:
    mag = sqrt(fft_real[k]² + fft_imag[k]²)
    acc += mel_dct_matrix[c][k] * log(mag + epsilon)
  mfcc[c] = acc

The inner loop is SIMD-friendly. ARM NEON vfma (fused multiply-add) processes four bins per cycle. Precomputing log(mel_dct_matrix) is invalid because the log applies to magnitude, not the weight, but we can hoist the epsilon addition outside the hot path if magnitudes are guaranteed positive.

ARM NEON Implementation

On ARMv8-A, we use float32x4_t vectors. The FFT outputs interleaved real/imaginary pairs; we deinterleave with vld2q_f32, compute magnitude via vmlaq_f32 for r²+i², then vsqrtq_f32. The log is approximated with a polynomial minimax fit (max error 0.0003 over [1e-6, 100])—full logf costs 12 cycles per element, the polynomial costs 3.

For the DCT accumulation, we unroll the outer loop by 4 and use vfmaq_laneq_f32 to broadcast each magnitude across four coefficient accumulators. This reduces loop overhead and hides NEON latency (5 cycles for vfma throughput, 2 for latency). On Cortex-A76, the fused kernel processes 257 bins in 1,840 cycles (0.46ms at 2.4GHz), versus 980 cycles for FFT + 1,120 for separate DCT (2,100 total)—a 52% reduction.

Memory Layout and Prefetching

The mel-DCT matrix is stored row-major to maximize spatial locality during the inner loop. We insert __builtin_prefetch hints two cache lines ahead (128 bytes on A76) to hide DRAM latency. The FFT output is also row-major (frequency-major), so stride-1 access is guaranteed. Aligning the matrix to 64 bytes ensures it spans an integer number of cache lines, avoiding partial evictions.

For apps processing multiple audio channels (stereo hearing aids, multi-mic beamforming), we batch the fused kernel across channels with OpenMP #pragma omp simd. Two-channel MFCC extraction drops from 3.6ms to 0.92ms on a quad-core A76 cluster—critical when running alongside real-time WDRC (wide dynamic range compression) or noise reduction.

Numerical Stability

Fusing the log into the accumulation loop risks catastrophic cancellation if mel weights are tiny. We clamp the mel-DCT matrix entries to [-1e6, 1e6] during precomputation and add a floor of 1e-10 to magnitudes before log. This maintains 23-bit mantissa precision across the dynamic range of speech (20–80 dB SPL). Validation against reference librosa outputs shows mean absolute error