Composable Audio Graphs: DSP Pipeline Design

The Problem with Hardcoded DSP Chains

Most mobile audio applications wire DSP stages together at compile time: a highpass filter feeds a compressor, which feeds a limiter, output done. This works until product needs shift—suddenly you need adaptive noise cancellation that conditionally inserts stages, or A/B testing that swaps filter topologies without shipping new binaries. Hardcoded chains become unmaintainable fast.

The alternative: treat audio processing as a directed acyclic graph (DAG) where nodes are processing units and edges are sample buffers. This pattern, borrowed from tools like Max/MSP and JUCE's AudioProcessorGraph, brings runtime flexibility without sacrificing real-time guarantees. Shipping HearingAid Pro—an AirPods DSP product—required exactly this: users toggled frequency shaping, compression, and spatial audio independently, demanding a graph that could reconfigure in under 10ms without glitches.

Graph Topology and Execution Order

A valid audio graph is a DAG where each node declares input/output channel counts and sample rate constraints. Cycles cause feedback (occasionally useful for reverb tails, but require explicit delay nodes). The runtime must:

Topologically sort nodes to determine execution order
Allocate intermediate buffers sized to the graph's max block size
Handle fan-out (one node feeding multiple downstream consumers) with buffer copies or shared immutable views
Detect and reject invalid graphs before entering the real-time thread

Topological sort is straightforward—Kahn's algorithm runs in O(V+E) and fits comfortably in a non-realtime setup phase. The tricky part: ensuring buffer lifetimes don't overlap incorrectly. A node can't overwrite its input buffer if another node downstream still needs it. Conservative approach: each edge gets a dedicated buffer. Optimized: analyze the graph to reuse buffers for non-overlapping lifetimes (graph coloring problem, solvable greedily for small graphs).

Concrete Example: Four-Node Chain

Consider a graph: Input → Highpass(200Hz) → Compressor(3:1) → Limiter(-1dBFS) → Output. At 48kHz, 128-sample blocks, the runtime allocates three intermediate float buffers (128 samples each). On each audio callback:

Input node copies hardware buffer to Buffer_A
Highpass reads Buffer_A, writes Buffer_B
Compressor reads Buffer_B, writes Buffer_C
Limiter reads Buffer_C, writes hardware output buffer

Total latency: 128 samples (2.67ms at 48kHz) plus node processing time. Each node's process() must complete in under ~200µs to stay within budget on mobile ARM cores.

Node Interface and Type Safety

In Dart/Flutter, define a base class:

abstract class AudioNode {
  int get inputChannels;
  int get outputChannels;
  void process(Float32List input, Float32List output, int blockSize);
  void configure(int sampleRate);
}

Concrete nodes implement process() with actual DSP. For example, a simple highpass IIR filter:

class HighpassNode extends AudioNode {
  double _z1 = 0.0, _z2 = 0.0;
  double _b0, _b1, _b2, _a1, _a2;

  HighpassNode(double cutoffHz, int sampleRate) {
    // Compute biquad coefficients (simplified)
    final w0 = 2 * pi * cutoffHz / sampleRate;
    final alpha = sin(w0) / (2 * 0.707);
    _b0 = (1 + cos(w0)) / 2;
    _b1 = -(1 + cos(w0));
    _b2 = _b0;
    _a1 = -2 * cos(w0);
    _a2 = 1 - alpha;
  }

  void process(Float32List input, Float32List output, int blockSize) {
    for (int i = 0; i < blockSize; i++) {
      final x = input[i];
      final y = _b0 * x + _b1 * _z1 + _b2 * _z2 - _a1 * _z1 - _a2 * _z2;
      _z2 = _z1; _z1 = x;
      output[i] = y;
    }
  }
}

Type safety comes from compile-time channel count checks. A graph builder validates connections: stereo output can't feed mono input without an explicit downmix node. This catches wiring errors before runtime, critical when changes deploy to production without audio QA.

Runtime Reconfiguration Without Glitches

Users toggle effects mid-playback. Naive approach: lock the audio thread, rebuild the graph, unlock. Result: audible clicks or dropouts. Better: double-buffered graphs. Maintain two graph instances—active (running in audio thread) and staging (being modified in UI thread). When reconfiguration completes:

Allocate new buffers for staging graph
Copy stateful node parameters (filter coefficients, compressor envelopes) from active to staging
Atomic pointer swap: staging becomes active
Deallocate old graph on a background thread

The swap happens between audio callbacks (during the ~10ms gap at 128-sample blocks). Copying state prevents discontinuities: a compressor mid-attack should resume at the same envelope level, not reset to zero.

Parameter Smoothing

Changing filter cutoff from 200Hz to 2kHz instantly causes zippering. Solution: parameter smoothing—interpolate coefficient changes over N samples (typically 32-64). Each node stores target and current coefficient values, incrementing toward target each block. This adds O(parameters) overhead but eliminates artifacts.

Fan-Out and Parallel Branches

Real graphs have fan-out: one node feeds multiple consumers. Example: send pre-compressor signal to a spectrum analyzer while post-compressor goes to output. Two strategies:

Buffer duplication: Copy the source buffer for each consumer. Simple, cache-unfriendly for large block sizes.
Immutable views: Nodes declare read-only inputs. Source buffer isn't copied; multiple nodes read the same memory. Requires discipline—no node can write to an input buffer.

For graphs with extensive fan-out (think modular synth with 20+ parallel analyzers), immutable views cut memory bandwidth 10x. Trade-off: stricter API contracts and potential race conditions if nodes cheat and mutate inputs.

Latency and Block Size Trade-Offs

Smaller blocks (64 samples) reduce latency but increase callback frequency—more overhead, less time per node. Larger blocks (512 samples) amortize overhead but add 10ms+ latency, unacceptable for live monitoring. Mobile targets: 128-256 samples balances latency (~3-5ms) and CPU budget.

iOS Core Audio defaults to 512 samples; override via AVAudioSession.preferredIOBufferDuration. Android AAudio allows 48-192 samples with low-latency paths. Flutter's audio plugins (like flutter_sound) often force 1024+ samples—custom platform channels with native audio engine integration become necessary for sub-5ms latency.

Memory and SIMD Considerations

Dart's Float32List is heap-allocated; for real-time, pre-allocate all buffers during graph setup. Never allocate in process(). On ARM, NEON intrinsics accelerate buffer operations (copy, mix, multiply). A SIMD-optimized buffer copy runs 4x faster than scalar loops:

// Pseudocode for NEON (actual implementation in C++ via FFI)
void copyBufferNEON(float* src, float* dst, int count) {
  for (int i = 0; i < count; i += 4) {
    float32x4_t vec = vld1q_f32(&src[i]);
    vst1q_f32(&dst[i], vec);
  }
}

Dart FFI bridges to native C++ for critical paths. Graph logic stays in Dart (easy to test, modify); hot loops drop to native. This hybrid keeps development velocity high while hitting performance targets.

Testing and Debugging

Unit test nodes individually: feed known input (sine wave, impulse), verify output (FFT, peak detection). Integration tests run full graphs offline, comparing output to reference recordings (perceptual audio quality metrics like PESQ for speech, PEAQ for music).

Debugging real-time code is hard—print statements cause dropouts. Instead: log to a lock-free ring buffer, drain on a background thread. Record input/output buffers to disk for post-mortem analysis. Tools like Superpowered's AudioLogger or custom solutions work well.

Production Lessons

Shipping a graph-based DSP engine in HearingAid Pro revealed:

Users create invalid graphs: Provide a validation layer that simulates execution, catches buffer underruns, infinite loops (via cycle detection).
Mobile thermal throttling: On sustained load (30+ minutes), iOS/Android reduce CPU clocks. Graph must adapt: drop non-critical nodes (spectrum analyzers, visual feedback) to stay within thermal budget.
State persistence: When app backgrounds, serialize graph topology and node parameters to JSON, restore on resume. Stateful nodes (reverbs with long tails) need explicit reset methods.

When Not to Use Graphs

Fixed, simple pipelines (e.g., just a limiter) don't justify graph overhead. Hardcode those—less complexity, easier to optimize. Graphs shine when:

Users configure DSP (plugins, modular effects)
A/B testing requires runtime topology changes
Conditional processing based on signal analysis (adaptive noise gates, dynamic EQ)

For most production audio apps, the flexibility outweighs the 5-10% CPU overhead of graph traversal.

Future Directions

GPU-accelerated DSP via Metal/Vulkan compute shaders can parallelize node execution—process independent branches simultaneously. Latency remains bound by longest path, but throughput increases. WebAssembly SIMD (wasm-simd) enables graph engines that run identically in browsers and native apps, useful for web-based DAWs or plugin sandboxes.

Composable audio graphs transform rigid DSP pipelines into flexible, testable, runtime-reconfigurable systems. The architecture scales from simple filter chains to complex adaptive processors, maintaining real-time guarantees while enabling rapid product iteration—essential for modern audio applications.