SIMD Convolution for On-Device STT: 4× Faster

The Convolution Bottleneck in Mobile Speech Recognition

Modern on-device speech-to-text engines—whether Whisper-tiny quantized for mobile or custom acoustic models—spend 40–60% of their feature extraction time performing 1D convolutions over raw audio waveforms. A typical preprocessing pipeline computes mel-spectrograms via short-time Fourier transform (STFT), then applies learned convolutional filters to extract phonetic features before feeding tokens to an encoder-decoder or CTC model.

On a mid-range Android device processing 16kHz mono audio with a 25ms frame and 10ms stride, scalar convolution of a 512-sample window against 64 filters of length 9 takes roughly 180ms per second of audio. This means real-time transcription is impossible without hardware acceleration. The culprit: naive nested loops that fail to exploit data-level parallelism inherent in modern ARM and x86 SIMD instruction sets.

Shipping KidzCare—a speech therapy app requiring sub-100ms latency for live phoneme feedback—forced a deep dive into vectorized DSP. The solution: rewriting the convolution kernel with ARM NEON intrinsics on Android and Apple Accelerate on iOS, achieving 4× throughput and enabling true real-time transcription on devices as old as iPhone X.

Scalar Convolution: The Baseline

A 1D convolution slides a kernel h[k] of length K over an input signal x[n] of length N, computing output y[n] = Σ h[k] × x[n-k] for each position. In C++, the naive implementation is:

for (int n = 0; n < N; ++n) {
  float sum = 0.0f;
  for (int k = 0; k < K; ++k) {
    sum += h[k] * x[n - k];
  }
  y[n] = sum;
}

For 64 filters and 512 samples, this performs 64 × 512 × 9 = 294,912 multiply-accumulate operations. On a Snapdragon 865 running at 2.84GHz, scalar floating-point throughput is roughly 1.6 GFLOPS per core, yielding ~180ms for this block. The inner loop has no data reuse across iterations, and the compiler cannot auto-vectorize due to the sliding window dependency.

NEON Vectorization: Four Samples at Once

ARM NEON provides 128-bit vector registers holding four 32-bit floats. The key insight: load four consecutive samples of x into a vector, broadcast each kernel coefficient to a vector, multiply, and accumulate. The kernel becomes:

float32x4_t acc = vdupq_n_f32(0.0f);
for (int k = 0; k < K; ++k) {
  float32x4_t xvec = vld1q_f32(&x[n - k]);
  float32x4_t hvec = vdupq_n_f32(h[k]);
  acc = vmlaq_f32(acc, xvec, hvec);
}
float result[4];
vst1q_f32(result, acc);
y[n] = result[0] + result[1] + result[2] + result[3];

This processes four output samples per iteration of the outer loop. Throughput jumps to ~6.4 GFLOPS on the same core, cutting latency to ~45ms. The vmlaq_f32 instruction performs fused multiply-add, reducing rounding error and saving one cycle per operation.

Handling Boundary Conditions

At the edges of the signal, fewer than four samples remain. A tail loop handles the last 0–3 samples with scalar code. Alternatively, pad x with zeros to the nearest multiple of four. For 512 samples, padding adds negligible overhead but simplifies dispatch logic. In practice, zero-padding is preferred because it avoids branch mispredictions in the hot path.

Apple Accelerate: vDSP_conv on iOS

iOS provides the Accelerate framework, which wraps hand-tuned NEON assembly in a high-level API. The vDSP_conv function performs single-precision convolution with minimal overhead:

vDSP_conv(x, 1, h + K - 1, -1, y, 1, N, K);

The third argument reverses the kernel (standard convolution definition), and strides of 1 indicate contiguous memory. On an A14 Bionic, this achieves ~50ms for the same workload—comparable to hand-rolled NEON but with zero maintenance burden. Accelerate also handles edge cases and cache optimization internally, making it the pragmatic choice for iOS-only codebases.

Multi-Filter Parallelism

Speech models often apply 64–128 filters in parallel. Each filter is independent, so the outer loop over filters is trivially parallelizable. Using Grand Central Dispatch on iOS:

dispatch_apply(num_filters, dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0), ^(size_t i) {
  vDSP_conv(x, 1, filters[i] + K - 1, -1, outputs[i], 1, N, K);
});

On a 6-core A15, this saturates four performance cores, yielding ~12ms total latency for 64 filters. Memory bandwidth becomes the bottleneck: each filter reads 2KB of input and writes 2KB of output, totaling 256KB per frame. With 50GB/s DRAM bandwidth, theoretical minimum is ~5ms, leaving 7ms for compute and cache misses.

Cache-Aware Tiling for Large Models

When the number of filters exceeds L2 cache capacity (typically 2–4MB on mobile SoCs), performance degrades due to cache thrashing. A tiled approach processes filters in blocks that fit in L2:

const int tile_size = 16; // fits in 512KB L2
for (int tile = 0; tile < num_filters; tile += tile_size) {
  for (int i = tile; i < min(tile + tile_size, num_filters); ++i) {
    vDSP_conv(x, 1, filters[i] + K - 1, -1, outputs[i], 1, N, K);
  }
}

This keeps x hot in cache across 16 filters, reducing DRAM fetches by 15×. On a Pixel 7 Pro with 3MB L2, tiling cuts latency from 65ms to 38ms for 128 filters. The optimal tile size depends on filter length and L2 size; empirical tuning is essential.

Quantized Convolution: INT8 for 2× Speedup

For models trained with quantization-aware techniques, filters and activations are 8-bit integers. ARM NEON provides vmlal_s8 for 8×8→16-bit multiply-accumulate, processing eight samples per instruction. The kernel:

int16x8_t acc = vdupq_n_s16(0);
for (int k = 0; k < K; ++k) {
  int8x8_t xvec = vld1_s8(&x_q[n - k]);
  int8x8_t hvec = vdup_n_s8(h_q[k]);
  acc = vmlal_s8(acc, xvec, hvec);
}
int32_t sum = vaddlvq_s16(acc); // horizontal sum
y[n] = (sum * scale) >> shift; // dequantize

This doubles throughput to ~12.8 GOPS on Snapdragon 8 Gen 2, cutting latency to ~23ms. Quantization introduces