Vectorized SIMD Convolution for Mobile CV Filters

Convolution sits at the heart of nearly every computer vision pipeline—edge detection, blurs, sharpening, custom feature extraction. On mobile, a naive spatial convolution over a 1080p frame with a 5×5 kernel can easily burn 40–50ms of wall-clock time, obliterating any hope of real-time processing. Single Instruction Multiple Data (SIMD) intrinsics—ARM NEON on Android, Accelerate vImage on iOS—can collapse that latency to under 7ms, but only if you architect the inner loop correctly and respect cache-line boundaries.

Why Scalar Convolution Fails on Mobile

A textbook 2D convolution iterates over every pixel, multiplies it by a kernel window, and accumulates the result. For a 1920×1080 image and a 5×5 kernel, that's roughly 50 million multiply-accumulate operations. On a modern ARM Cortex-A76 core at 2.4 GHz, scalar floating-point throughput hovers around 4 FLOPS per cycle under ideal conditions—meaning you need ~12.5 million cycles, or about 5ms, if every instruction were perfectly pipelined and cache-hot. In practice, cache misses, branch mispredictions, and memory bandwidth stalls push that to 35–50ms.

The culprit: each pixel read is a separate load, and the kernel coefficients live in a tiny array that gets fetched repeatedly. Modern mobile CPUs have 128-bit or 256-bit SIMD units capable of processing four or eight floats in parallel, but scalar code leaves those lanes idle.

ARM NEON: Four-Wide Float32 Lanes

NEON provides float32x4_t registers that hold four single-precision floats. A fused multiply-accumulate becomes vmlaq_f32(acc, vec_a, vec_b), executing four MACs in a single instruction. For a 5×5 kernel, you can unroll the horizontal pass: load four adjacent pixels into a NEON register, broadcast each kernel coefficient into another register, multiply, and accumulate into a running sum.

Here's the critical insight: instead of processing one pixel at a time, you process four horizontally adjacent pixels per iteration. A 1920-pixel row becomes 480 iterations of four-wide SIMD, plus a scalar tail for the remainder. The kernel coefficients are replicated across all four lanes using vdupq_n_f32(coeff), so every pixel in the vector gets multiplied by the same weight. This transforms 25 scalar multiplies per output pixel into ~7 NEON instructions—a 3.5× reduction in instruction count, and closer to 6× in wall-clock time once you account for memory bandwidth reuse.

Register Pressure and Loop Unrolling

NEON has 32 128-bit registers (Q0–Q31). A naive implementation might load five rows of four pixels each (five registers), five kernel coefficients (five more), and one accumulator—11 registers. But you also need temporaries for intermediate products. Unrolling the kernel loop by two or four reduces loop overhead but increases register pressure. On a Snapdragon 8 Gen 2, I measured a 12% speedup by unrolling the 5×5 kernel into ten vmlaq_f32 calls per output pixel, at the cost of spilling two registers to the stack. Profiling with perf showed the spill cost was negligible compared to the saved branch mispredictions.

Apple Accelerate: vImage Convolution

iOS provides vImageConvolve_PlanarF, a heavily optimized convolution routine that uses both NEON and Metal GPU paths depending on image size. For a 5×5 kernel on an A16 Bionic, vImage dispatches to a hand-tuned NEON implementation that processes eight floats per cycle using double-pumped load/store units. Latency for 1080p drops to 4–6ms, and the API handles edge cases (mirror/clamp/zero padding) without manual branching.

The tradeoff: vImage is iOS-only, and its internals are opaque. You sacrifice portability and fine-grained control over memory layout. For cross-platform Flutter or React Native apps, you need a custom NEON kernel or a third-party library like OpenCV's cv::filter2D, which internally uses NEON but adds ~800KB to your binary.

Separable Kernels: Gaussian Blur in 2×5 Instead of 5×5

A Gaussian blur kernel is separable: a 5×5 matrix factors into a 5×1 column vector and a 1×5 row vector. You convolve horizontally with the row vector, then vertically with the column—two passes of 5 multiplies instead of one pass of 25. NEON shines here: the horizontal pass loads four pixels, multiplies by five broadcasted coefficients, and writes four outputs. The vertical pass reads from the transposed intermediate buffer, again four-wide. Net result: 2×5×4 = 40 NEON MACs per four output pixels, versus 25×4 = 100 for the non-separable case. On a Pixel 7 Pro, separable Gaussian blur measured 9ms versus 23ms for the naive approach.

Memory Layout: Planar vs Interleaved

Mobile camera APIs often deliver frames in NV21 (Y plane + interleaved UV) or RGBA interleaved format. NEON convolution is most efficient on planar data—separate R, G, B planes—because you can load contiguous floats without stride. Converting RGBA to planar costs ~2ms for 1080p using vld4_u8 (deinterleave four channels), but that's a one-time upfront cost. If you're chaining multiple convolutions (edge detection → blur → sharpening), keeping intermediate buffers planar pays off immediately.

For applications like real-time document scanning or barcode detection, luminance (Y channel) is often sufficient. Dropping chroma and operating on a single plane cuts memory bandwidth by 3× and cache footprint by the same factor. In a recent OCR preprocessing pipeline, switching from RGB to Y-only convolution reduced frame latency from 18ms to 6ms, enabling 120fps on a mid-range Snapdragon 7 Gen 1.

Boundary Handling Without Branches

Edge pixels require special treatment: clamp, mirror, or zero-pad. A naive if (x < 0 || x >= width) inside the hot loop kills SIMD efficiency. Instead, pre-pad the input buffer by kernel_radius on all sides, filling with clamped or mirrored values. This shifts complexity to a one-time setup phase and keeps the inner loop branchless. For a 5×5 kernel, you pad two pixels on each edge, increasing the working buffer from 1920×1080 to 1924×1084—a 0.4% memory overhead that's trivial compared to the 5× speedup from eliminating branches.

Cache-Line Alignment

ARM CPUs fetch memory in 64-byte cache lines. Misaligned NEON loads can straddle two cache lines, doubling latency. Aligning each row's start address to 64 bytes (via posix_memalign or malloc with manual offset) ensures every vld1q_f32 hits a single cache line. On a Dimensity 9200, aligned loads measured 1.2ns versus 2.8ns for misaligned—enough to shave 1–2ms off a full-frame convolution. The cost: up to 63 bytes of padding per row, or ~130KB for 1080p, which is acceptable on modern devices with 6–12GB RAM.

Real-World Integration: Flutter Plugin Architecture

Shipping SIMD convolution in a Flutter app requires a platform channel or FFI. For maximum performance, use dart:ffi to call a native C library compiled with -march=armv8-a+simd. Pass pixel data as a Uint8List backed by native memory (via malloc and Pointer), avoiding Dart's garbage-collected heap. On a Galaxy S23, FFI overhead measured ~50μs per call—negligible compared to the 6ms convolution itself.

For iOS, wrap the vImage call in a Swift method and expose it via a method channel. This adds ~200μs of serialization overhead, but vImage's 4ms runtime still dominates. In a recent barcode scanner project, the end-to-end pipeline (camera frame → NEON edge detection → Dart UI update) ran at 90fps on an iPhone 13, with convolution consuming just 6ms of the 11ms budget.

Benchmarking and Profiling

Always measure on-device with realistic workloads. Android Studio's CPU Profiler and Xcode Instruments reveal hotspots, but for SIMD code, perf stat on a rooted Android device gives cycle-accurate metrics: instructions per cycle, cache hit rates, NEON unit utilization. I routinely see scalar convolution at 0.8 IPC (instructions per cycle) and NEON at 2.2 IPC—a 2.75× efficiency gain that translates directly to battery life savings.

Thermal throttling matters: a 6ms convolution at full clock can balloon to 12ms after 30 seconds of sustained load as the SoC throttles from 2.8 GHz to 1.6 GHz. Design your pipeline to stay under thermal limits, or batch-process frames during idle periods.

When Not to Vectorize

SIMD wins for large, regular convolutions but loses for tiny kernels (3×3) or irregular access patterns (sparse convolutions, dilated kernels). A 3×3 kernel has only 9 multiplies per pixel; the overhead of loading and shuffling NEON registers can exceed the scalar cost. Similarly, if your kernel has many zero coefficients (e.g., Sobel), a sparse multiply-accumulate loop with explicit zero-checks often beats blind SIMD. Profile first, optimize second.

For applications requiring sub-5ms latency (AR filters, real-time video effects), consider offloading to the GPU via Metal or Vulkan compute shaders. A 5×5 convolution in Metal can hit 2ms on an A16 GPU, but the CPU↔GPU transfer adds 1–2ms. SIMD on the CPU avoids that roundtrip and keeps data in the same memory domain as downstream processing.