Shader-Based PPG Filtering: GPU DSP at 240fps

Photoplethysmography (PPG) signal extraction from smartphone cameras demands low-latency filtering pipelines that can keep pace with 120–240fps capture rates while preserving battery life. Traditional CPU-based DSP approaches—even with SIMD vectorization—struggle to maintain this cadence without thermal throttling or draining battery within minutes. The solution lies in offloading the entire filter chain to the GPU via compute shaders, exploiting parallel execution across thousands of threads and leveraging zero-copy texture memory.

This article dissects a production Metal-based PPG filtering architecture deployed in a mobile glucose monitoring app, detailing shader design, memory layout, numerical stability considerations, and the power-performance tradeoffs that emerge when you treat signal processing as a massively parallel graphics problem.

Why GPU for PPG Signal Processing

PPG extraction involves capturing video frames from the camera, isolating the red or green channel, computing spatial averages across regions of interest (fingertip, forehead), then applying a cascade of filters: DC removal, bandpass (0.5–4Hz for heart rate), notch filters for powerline interference, and optional adaptive smoothing. At 240fps, this means processing 240 frames per second—each requiring multiple filtering stages—while the UI remains responsive and the device stays cool.

CPU implementations hit three bottlenecks: memory bandwidth (copying frame buffers), thermal limits (sustained high clock speeds), and thread contention (competing with UI and OS tasks). GPUs solve all three: frame data already resides in GPU texture memory, thousands of shader cores distribute the workload, and the entire pipeline executes in a dedicated command queue isolated from the main thread.

In benchmarks on an iPhone 14 Pro, a Metal compute shader pipeline processed 240fps PPG at 3.2ms per frame (including texture upload and readback) with 1.8W power draw. The equivalent CPU SIMD path took 8.1ms at 3.4W, thermal-throttling after 90 seconds. The GPU approach sustained full throughput indefinitely at 65°C case temperature.

Shader Architecture: Threadgroup Memory as Circular Buffer

The core challenge is state management. IIR filters (Butterworth, Chebyshev) require previous input and output samples—typically stored in a circular buffer. GPUs lack persistent per-frame state, so we encode filter state in a dedicated texture that persists across compute dispatches.

Each compute shader dispatch processes one frame. The shader reads the current frame's spatial average (pre-computed by a separate reduction shader), fetches the last N samples from a 1D state texture, applies the filter coefficients, writes the output, and updates the state texture by shifting samples. This mimics a CPU ring buffer but with texture atomics ensuring coherence.

kernel void ppg_bandpass(
    texture2d<float, access::read> inFrame [[texture(0)]],
    texture1d<float, access::read_write> state [[texture(1)]],
    device float* output [[buffer(0)]],
    uint2 gid [[thread_position_in_grid]])
{
    // Spatial average (simplified)
    float sample = inFrame.read(gid).r;
    
    // Fetch last 4 samples (2nd-order IIR)
    float x1 = state.read(0).r;
    float x2 = state.read(1).r;
    float y1 = state.read(2).r;
    float y2 = state.read(3).r;
    
    // Butterworth coefficients (0.5–4Hz @ 240fps)
    float b0 = 0.0012, b1 = 0.0024, b2 = 0.0012;
    float a1 = -1.9556, a2 = 0.9604;
    
    float y0 = b0*sample + b1*x1 + b2*x2 - a1*y1 - a2*y2;
    
    // Shift state
    state.write(float4(sample, x1, y0, y1), 0);
    output[0] = y0;
}

This design trades texture memory (negligible—16 floats per filter stage) for parallelism. Each frame's processing is independent once state is loaded, allowing the GPU to pipeline multiple frames in flight.

Numerical Stability: Half-Precision Pitfalls

Mobile GPUs favor 16-bit half-precision floats for power efficiency. PPG signals have large DC offsets (raw pixel values 0–255) but tiny AC components (±0.5 after normalization). Half-precision's 10-bit mantissa causes catastrophic cancellation when subtracting nearly equal values in DC removal stages.

The fix: segregate precision tiers. Use half for spatial averaging (error ±0.01 acceptable), float for IIR state and coefficients (error must be