Zero-Copy Audio Routing: CoreAudio → ML Pipeline

Real-time audio ML applications—speech therapy apps, hearing aids, voice trainers—face a brutal constraint: total glass-to-glass latency must stay under 10–15ms to avoid perceptible lag. Yet the naive approach of capturing audio in CoreAudio callbacks, copying buffers to an ML inference queue, and copying results back burns 3–6ms in pure memory operations. For apps like KidzCare or HearingAid Pro processing live speech at 48kHz with 128-sample frames (2.67ms per frame), every microsecond counts.

This article dissects zero-copy audio routing patterns on iOS, showing how shared memory pools and careful AVAudioEngine tap placement cut latency by 40–60% while maintaining real-time safety guarantees.

The Memcpy Tax in Traditional Pipelines

A typical audio ML pipeline on iOS looks like this:

AVAudioEngine input tap delivers AVAudioPCMBuffer on the real-time thread
Callback copies float samples into a ring buffer or dispatch queue
Background thread dequeues, copies again into ML model input tensor
Inference runs (ONNX Runtime, Core ML, custom DSP)
Output tensor copied to audio output buffer
Render callback copies to CoreAudio output

Each copy touches 512–2048 bytes per frame. At 48kHz with 128-sample frames, that's 375 frames/sec × 4 copies × 512 bytes = 768KB/sec of pure memory bandwidth, plus cache pollution. On A-series chips with 8–16MB L2 cache shared across efficiency cores, this evicts working set data and adds 200–800μs per frame in memory stalls.

Profiling HearingAid Pro's initial implementation with Instruments showed 34% of real-time thread time in memcpy and vDSP_mmov. Unacceptable for a hearing aid where every millisecond of delay degrades speech intelligibility.

Shared Buffer Pools: Pre-Allocated, Lock-Free

The core insight: audio buffers and ML tensors can share the same memory if you control allocation and ensure single-writer/single-reader access patterns. We allocate a circular pool of AudioBufferList-compatible regions at app launch:

struct SharedAudioBuffer {
  float* samples;        // 16-byte aligned
  uint32_t frameCount;
  uint64_t timestamp;    // mach_absolute_time()
  atomic state; // 0=free, 1=writing, 2=reading
};

SharedAudioBuffer pool[8]; // 8× 128-frame buffers = 4KB total

The input tap callback claims a buffer via compare-and-swap, writes directly to samples, then atomically transitions state to reading. The ML thread spins on available buffers (with exponential backoff to avoid burning CPU), wraps the float* in an ONNX Ort::Value using CreateTensorWithDataAsOrtValue with use_arena=false, runs inference, and flips state back to free.

Critical: the buffer must remain valid for the entire inference duration. ONNX Runtime and Core ML both support external memory tensors, but you must guarantee the pointer stays live. We use a two-phase commit: the render callback checks if output is ready before releasing the input buffer, ensuring the ML thread never reads freed memory.

AVAudioEngine Tap Placement

Where you tap matters. Tapping inputNode gives you raw ADC samples but forces you to handle format conversion (typically 16-bit int → float). Tapping after a mixer node lets CoreAudio handle conversion, but adds one extra buffer copy internally.

For HearingAid Pro, we tap the input node and use vDSP_vflt16 (16-bit int to float) with NEON intrinsics, converting directly into the shared buffer. This saves one allocation and keeps the hot path under 40μs on iPhone 12.

Output Path: Inplace Tensor Mutation

The output side is trickier. Most ML models allocate their own output tensors. To avoid a copy, we either:

Preallocate output tensors in the shared pool and pass them as Ort::IoBinding targets (ONNX Runtime 1.10+). This works if your model supports external output buffers.
Use Metal shaders for post-processing (gain, EQ, limiting) that write directly to AudioBufferList.mBuffers[0].mData. CoreAudio's render callback can then reference the same pointer.

We chose option 2 for HearingAid Pro because our DSP chain (noise suppression → dynamic range compression → frequency shaping) is faster in Metal than CPU SIMD, and Metal can write to IOSurface-backed buffers that CoreAudio maps without copying.

Real-Time Thread Safety

The audio render thread is THREAD_TIME_CONSTRAINT priority with a 2.9ms budget on iOS. You cannot:

Allocate or free memory
Take locks (even spinlocks risk priority inversion)
Call Objective-C methods (potential retain/release)
Touch Swift copy-on-write containers

Our shared buffer pool is plain C structs, atomics use memory_order_acquire/release, and the render callback only reads pointers and copies a single memcpy if the ML thread hasn't finished (graceful degradation: we repeat the last frame, inaudible for 2.67ms).

Latency Breakdown: Before and After

Measured on iPhone 13 Pro, 48kHz, 128-frame buffers, ONNX Runtime 1.14 with CoreML delegate:

StageNaive (μs)Zero-Copy (μs)Input copy4200Queue handoff18035 (CAS spin)Tensor wrap3108Inference18501850Output copy3900Render callback220220Total33702113

Net savings: 1.26ms per frame, or 37% reduction. Combined with other optimizations (quantized models, Metal pre-processing), HearingAid Pro achieves 4.2ms glass-to-glass latency, well below the 10ms threshold where users report discomfort.

Edge Cases and Failure Modes

Buffer starvation: If ML inference takes longer than the frame period, the pool empties. We reserve 2 buffers as emergency fallback—render callback uses the last successfully processed frame and logs a glitch. In production, this happens