Ring Buffer Audio I/O: Lock-Free DSP in Swift

Real-time audio on iOS demands sub-10ms latency and zero thread contention. When your render callback fires every 5.8ms at 44.1kHz (256 samples), a single mutex lock can blow your budget. Lock-free ring buffers—circular queues with atomic indices—solve this by letting producer and consumer threads coordinate without blocking.

This article walks through implementing a production-grade ring buffer in Swift for AVAudioEngine DSP chains, drawing on patterns used in apps like HearingAid Pro where AirPods transparency mode meets custom EQ and compression.

Why Ring Buffers for Audio

Audio I/O sits at the intersection of hard real-time constraints and multi-threaded chaos. Your render thread cannot afford priority inversion: if it blocks waiting for a mutex held by a lower-priority thread, you get dropouts (xruns). The OS may preempt that holder, and your audio callback misses its deadline.

A lock-free ring buffer gives you:

Bounded wait-free reads/writes: O(1) operations with no syscalls
Cache-friendly sequential access: contiguous memory, predictable prefetch
Single-producer single-consumer (SPSC) guarantee: no CAS loops, just atomic loads/stores

Trade-off: you sacrifice dynamic resizing. Capacity is fixed at init, and overflow means dropped samples. In practice, a 2048-sample buffer at 48kHz gives you 42ms of headroom—enough to smooth jitter from file I/O or network fetch.

Memory Ordering Fundamentals

Swift's ManagedAtomic (from swift-atomics) exposes memory ordering semantics critical for correctness. A naïve implementation using plain integers will compile to reordered loads/stores on ARM64, causing phantom reads of stale data.

Key invariants:

Write index: producer increments after storing samples. Use .releasing store so prior writes are visible.
Read index: consumer increments after consuming samples. Use .acquiring load so it sees producer's stores.

Example skeleton in Swift:

import Atomics

final class LockFreeRingBuffer {
    private let buffer: UnsafeMutablePointer
    private let capacity: Int
    private let writeIndex = ManagedAtomic(0)
    private let readIndex = ManagedAtomic(0)
    
    func write(_ value: T) -> Bool {
        let w = writeIndex.load(ordering: .relaxed)
        let r = readIndex.load(ordering: .acquiring)
        let next = (w + 1) % capacity
        guard next != r else { return false } // full
        buffer[w] = value
        writeIndex.store(next, ordering: .releasing)
        return true
    }
    
    func read() -> T? {
        let r = readIndex.load(ordering: .relaxed)
        let w = writeIndex.load(ordering: .acquiring)
        guard r != w else { return nil } // empty
        let value = buffer[r]
        readIndex.store((r + 1) % capacity, ordering: .releasing)
        return value
    }
}

The .acquiring load in write ensures you see the consumer's latest position. The .releasing store in write publishes your data before updating the index. This forms a happens-before edge without locks.

Bulk Transfers for AVAudioPCMBuffer

Audio callbacks deal in chunks: 128, 256, or 512 frames at a time. Looping over write() 256 times is wasteful—cache misses, repeated modulo ops, atomic overhead. Instead, expose bulk methods:

func writeBulk(_ source: UnsafePointer, count: Int) -> Int {
    let w = writeIndex.load(ordering: .relaxed)
    let r = readIndex.load(ordering: .acquiring)
    let available = (r > w) ? (r - w - 1) : (capacity - w + r - 1)
    let toWrite = min(count, available)
    
    if toWrite == 0 { return 0 }
    
    let firstChunk = min(toWrite, capacity - w)
    buffer.advanced(by: w).update(from: source, count: firstChunk)
    
    if toWrite > firstChunk {
        let wrap = toWrite - firstChunk
        buffer.update(from: source.advanced(by: firstChunk), count: wrap)
    }
    
    writeIndex.store((w + toWrite) % capacity, ordering: .releasing)
    return toWrite
}

This handles the wrap-around in one shot: copy up to the end of the buffer, then copy the remainder to the start. For a 2048-sample buffer and 256-frame writes, you wrap every 8 callbacks—minimal overhead.

Mirror this for readBulk, and your AVAudioSourceNode can pull directly into its output buffer's floatChannelData.

Latency Budget Analysis

At 48kHz, 256 samples = 5.33ms. Your render callback must:

Read 256 samples from the ring buffer: ~0.5μs (L1 cache hit)
Apply DSP (EQ, compression): 1-3ms depending on filter order
Write to output: ~0.2μs

Total: ~4ms, leaving 1.3ms slack for OS jitter. If your producer thread (e.g., file decoder) falls behind, the buffer drains. Monitor availableForRead and emit telemetry when it drops below 512 samples (10.6ms)—that's your early warning for underrun.

In HearingAid Pro, we size the buffer to 4096 samples (85ms at 48kHz). This absorbs spikes from CoreML inference (20-30ms on A15) without dropout. The trade-off: 85ms glass-to-glass latency, acceptable for hearing aid processing but too high for live monitoring.

Avoiding False Sharing

On ARM64, cache lines are 64 bytes. If writeIndex and readIndex sit in the same line, the producer's store invalidates the consumer's cache, forcing a round-trip to L2. Pad them:

private let writeIndex = ManagedAtomic(0)
private let _pad1 = (0, 0, 0, 0, 0, 0, 0) // 56 bytes
private let readIndex = ManagedAtomic(0)
private let _pad2 = (0, 0, 0, 0, 0, 0, 0)

This keeps indices on separate cache lines, cutting cross-core traffic by ~40% in Instruments profiles. For multi-channel audio (stereo, 5.1), interleave samples by frame rather than by channel to maintain spatial locality.

Graceful Overflow Handling

When writeBulk returns less than requested, you have three options:

Drop samples: acceptable for live input (mic), unacceptable for playback
Block producer: defeats the lock-free promise, but necessary if you can't lose data
Dynamic buffer resize: requires allocating a new buffer and atomically swapping pointers—complex, rarely needed

For file playback, we pre-buffer 2 seconds (96,000 samples at 48kHz) before starting the render callback. If the decoder thread stalls, the buffer drains, but we have 2 seconds to recover. Emit a Sentry breadcrumb at 25% capacity so you can correlate underruns with device thermals or background task suspension.

Testing Lock-Free Correctness

Concurrency bugs are Heisenbugs: they vanish under the debugger's serialization. Use ThreadSanitizer (TSan) in Xcode, but also stress-test with:

DispatchQueue.concurrentPerform(iterations: 10_000) { i in
    if i % 2 == 0 {
        _ = buffer.write(Float.random(in: -1...1))
    } else {
        _ = buffer.read()
    }
}

Run this for 10 million ops. If you see corruption (samples appearing twice, or NaN), your memory ordering is wrong. On Apple Silicon, add -Xfrontend -sanitize=thread to catch data races that ARM's relaxed model exposes.

Production Metrics

In a shipping app handling 50,000+ daily audio sessions, track:

Underrun rate: percentage of callbacks that read fewer samples than needed
Latency percentiles: p50, p95, p99 of buffer occupancy
Atomic contention: if .acquiring loads spin (rare in SPSC), you have a logic bug

Correlate underruns with device model (A12 vs A17), thermal state, and background audio mode. On older devices, CPU throttling at 80°C can push decoder latency from 15ms to 60ms, draining the buffer.

When Not to Use Ring Buffers

If your producer and consumer run at vastly different rates (e.g., network stream at variable bitrate, render callback at fixed 48kHz), a ring buffer alone won't smooth jitter. Pair it with a resampler (vDSP's vDSP_vgenp) or adaptive playout buffer that stretches/compresses time. For multi-producer scenarios (mixing multiple audio sources), you need per-source buffers plus a lock-free mixer—complexity that often justifies AVAudioEngine's higher-level graph API.

Lock-free ring buffers shine in the narrow but critical case of SPSC audio I/O where latency and determinism matter more than flexibility. Implementing one correctly requires understanding memory models, cache effects, and real-time constraints—but the payoff is audio that never glitches, even under load.