Vectorized PPG Signal Processing: NEON vs Metal

Photoplethysmography (PPG) signals captured from smartphone cameras require aggressive preprocessing before clinical features can be extracted. At 60 frames per second, a typical pipeline must perform DC removal, bandpass filtering, motion artifact suppression, and peak detection—all within a 16ms budget to avoid frame drops. This article examines two vectorization approaches on iOS: ARM NEON intrinsics in C++ and Metal compute shaders, with real performance data from a production glucose monitoring application.

The PPG Preprocessing Pipeline

A raw PPG signal from the camera sensor arrives as a single-channel intensity stream. The preprocessing chain typically includes:

DC offset removal: High-pass IIR filter (0.5Hz cutoff) to eliminate baseline drift
Bandpass filtering: Butterworth 0.5-5Hz to isolate cardiac frequency band
Motion artifact suppression: Adaptive filtering using accelerometer reference
Peak detection: First derivative zero-crossing with amplitude thresholding
Feature extraction: Inter-beat interval, pulse amplitude, waveform morphology

On an iPhone 13 Pro, unoptimized scalar code processes this pipeline at approximately 180 samples/second—far below the 3600 samples/second required for real-time 60Hz operation with a 60-sample analysis window. Vectorization is mandatory, not optional.

ARM NEON Implementation

NEON provides 128-bit registers that can hold four 32-bit floats. The DC removal filter—a first-order IIR with difference equation y[n] = x[n] - x[n-1] + 0.995*y[n-1]—vectorizes naturally:

float32x4_t dc_remove_neon(float32x4_t input, float32x4_t prev_input, float32x4_t prev_output) {
  float32x4_t diff = vsubq_f32(input, prev_input);
  float32x4_t scaled = vmulq_n_f32(prev_output, 0.995f);
  return vaddq_f32(diff, scaled);
}

Processing 256-sample blocks with NEON intrinsics achieved 8200 samples/second on A15 Bionic—a 45× improvement over scalar code. However, this approach has significant drawbacks:

State management complexity: IIR filters maintain per-channel state that must be carefully threaded through vectorized loops
Edge case handling: Block boundaries require scalar fallback for samples that don't align to 4-wide vectors
Register pressure: Bandpass filtering (second-order sections) consumes 8-12 registers, causing spills on complex pipelines
Portability: NEON intrinsics lock you to ARM; porting to x86 requires SSE/AVX reimplementation

The real challenge emerges with adaptive filtering. Motion artifact suppression correlates accelerometer data with PPG signal using a normalized least mean squares (NLMS) algorithm. The weight update—w[n+1] = w[n] + μ * e[n] * x[n] / ||x[n]||²—requires vector normalization and element-wise multiplication that straddles multiple NEON instructions, making the inner loop difficult to optimize without assembly.

Metal Compute Shader Approach

Metal compute shaders offer a different tradeoff. The entire preprocessing pipeline compiles to GPU-resident code executed by hundreds of parallel threads. A typical kernel dispatches one thread per sample:

kernel void preprocess_ppg(
  device const float* raw [[buffer(0)]],
  device float* filtered [[buffer(1)]],
  device FilterState* state [[buffer(2)]],
  uint gid [[thread_position_in_grid]]
) {
  float x = raw[gid];
  float y = x - state[gid].prev_x + 0.995f * state[gid].prev_y;
  filtered[gid] = y;
  state[gid].prev_x = x;
  state[gid].prev_y = y;
}

Metal's advantage: implicit SIMD across thread groups. The GPU scheduler automatically packs 32 threads into SIMD groups, achieving near-optimal occupancy without manual vectorization. Measured throughput on iPhone 13 Pro: 24,000 samples/second—3× faster than NEON.

The adaptive filtering kernel benefits even more dramatically. Matrix operations in the NLMS algorithm map naturally to Metal's thread hierarchy:

kernel void nlms_filter(
  device const float* ppg [[buffer(0)]],
  device const float* accel [[buffer(1)]],
  device float* output [[buffer(2)]],
  device float* weights [[buffer(3)]],
  threadgroup float* shared [[threadgroup(0)]],
  uint gid [[thread_position_in_grid]],
  uint lid [[thread_position_in_threadgroup]]
) {
  // Compute dot product using threadgroup reduction
  shared[lid] = accel[gid] * accel[gid];
  threadgroup_barrier(mem_flags::mem_threadgroup);
  
  // Parallel reduction in shared memory
  for (uint stride = 32; stride > 0; stride >>= 1) {
    if (lid < stride) shared[lid] += shared[lid + stride];
    threadgroup_barrier(mem_flags::mem_threadgroup);
  }
  
  float norm = shared[0];
  float error = ppg[gid] - dot(weights, accel + gid);
  weights[gid] += 0.01f * error * accel[gid] / (norm + 1e-6f);
}

This achieves 52,000 samples/second—over 6× faster than NEON—by exploiting threadgroup shared memory for efficient reductions and leveraging hardware dot product instructions.

Tradeoffs and Decision Criteria

Metal's performance advantage comes with costs:

Latency: Command buffer submission and GPU scheduling add 1.2-1.8ms overhead per dispatch. For sub-millisecond processing, this dominates total time.
Power consumption: GPU wake-up from idle state draws 80-120mW for 10-15ms. Continuous processing mitigates this, but bursty workloads suffer.
Thermal envelope: Sustained GPU usage contributes to SoC thermals. In a 10-minute PPG session, Metal caused throttling at 8.5 minutes versus 12 minutes for NEON on iPhone 12.
Debugging complexity: Metal shader debugging requires Xcode GPU frame capture. NEON code debugs with standard LLDB.

NEON remains superior for:

Ultra-low latency requirements (<5ms end-to-end)
Battery-constrained scenarios (background processing)
Simple filter chains without complex data dependencies
Cross-platform codebases targeting Android (via NEON) and iOS

Metal wins for:

High-throughput batch processing (>10k samples/sec)
Complex algorithms with extensive parallelism (NLMS, wavelet transforms)
Sustained foreground processing where GPU remains active
iOS-exclusive applications willing to optimize per-platform

Hybrid Architecture in Production

The glucose monitoring system Omar Abu Sharifa shipped uses a hybrid approach: NEON for real-time DC removal and bandpass filtering (executing in the camera callback thread with <2ms latency), then Metal for batch motion artifact suppression and feature extraction on 5-second windows. This achieves 99.7% frame capture rate while maintaining <15% average CPU utilization and <8% GPU utilization.

Peak detection—the final stage—runs on CPU using NEON-accelerated derivative computation but scalar logic for threshold comparison. This hybrid stage processes at 18,000 samples/second, limited by branch mispredictions in the threshold logic rather than arithmetic throughput.

Implementation Notes

When implementing NEON pipelines, preallocate filter state buffers aligned to 16-byte boundaries using posix_memalign(). Unaligned loads incur 20-30% penalties on A-series chips. For Metal, use MTLResourceStorageModeShared on unified memory architectures to avoid unnecessary copies between CPU and GPU address spaces.

Profile with Instruments' Metal System Trace and CPU Profiler simultaneously. GPU-bound workloads show high occupancy (>80%) in Metal trace; CPU-bound shows stalls in command buffer encoding. This guides optimization priorities.

For production deployment, implement fallback paths: Metal Compute may be unavailable on older devices or in low-power mode. The NEON path serves as both optimization and compatibility layer.