Lock-Free Audio Queues: Real-Time DSP Threading

The Priority Inversion Problem in Audio Threads

Real-time audio processing on mobile devices operates under brutal constraints: a 512-sample buffer at 48kHz gives you 10.67 milliseconds to acquire input, process DSP, and write output before the next callback. Miss that deadline and users hear crackling, dropouts, or worse—complete audio engine stalls.

The naive approach uses mutexes to synchronize between the audio render thread (time-critical, elevated priority) and worker threads handling model inference or UI updates. This creates priority inversion: when a low-priority thread holds a lock the audio thread needs, the scheduler can't help. I've measured 40-80ms stalls in production hearing aid apps when a background network task blocked a mutex protecting the DSP parameter queue.

Lock-free data structures eliminate this class of failure entirely. By using atomic operations and careful memory ordering, we build queues that guarantee wait-free progress for the real-time thread regardless of what lower-priority code is doing.

Single-Producer Single-Consumer Ring Buffer

The foundational pattern is an SPSC ring buffer: one thread writes, one thread reads, no locks required. Here's the contract:

Fixed-size circular array (power-of-two capacity for fast modulo via bitmasking)
Two atomic indices: write head and read tail
Producer updates head after writing; consumer updates tail after reading
Empty when head equals tail; full when head is one slot behind tail

Critical insight: only the producer modifies the write index, only the consumer modifies the read index. This eliminates contention—each thread owns its index. The other thread reads the opposite index with acquire semantics to see updates.

In C++ with std::atomic, the producer push looks like:

size_t current_head = head_.load(std::memory_order_relaxed);
size_t next_head = (current_head + 1) & mask_;
if (next_head == tail_.load(std::memory_order_acquire)) return false; // full
buffer_[current_head] = item;
head_.store(next_head, std::memory_order_release);

The release store ensures prior writes to buffer_ are visible before the index update. The acquire load ensures we see the consumer's latest tail position. Relaxed load of our own head is safe—we're the only writer.

Memory Ordering and Cache Coherence

On ARM (all iOS devices, most Android), memory ordering is weaker than x86. Without explicit barriers, the CPU can reorder stores, making data visible before the index update—catastrophic if the consumer reads garbage. The acquire-release pair creates a synchronization point: all writes before the release are visible to any thread performing an acquire load of that same atomic.

In practice, on Apple Silicon, memory_order_release compiles to a DMB ISH (data memory barrier, inner shareable domain). This costs ~2-4 cycles—negligible compared to the 512,000 cycles available in a 10ms audio buffer. Contrast with a mutex lock: 50-200 cycles in the uncontended case, unbounded if another thread holds it.

For iOS audio, we use the Audio Unit render callback (executed on a real-time thread with priority 96). The lock-free queue sits between this callback and a lower-priority thread that updates DSP parameters—gain curves, compression ratios, EQ coefficients—based on user input or ML model output.

Handling Overruns and Underruns

A full queue (producer can't push) or empty queue (consumer can't pop) indicates backpressure. In audio, the policy differs by direction:

Parameter updates (low→high priority): Drop newest. The audio thread always reads the most recent value; if the queue is full, the UI thread discards its update. Users perceive this as slightly delayed response, not audio glitches.
Analysis results (high→low priority): Drop oldest. The audio thread must never block. If the consumer (UI or analytics) can't keep up, we overwrite stale data. For a hearing aid app shipping real-time spectral features to a visualization, 30fps is plenty—dropping frames is acceptable.

This asymmetry is key. The real-time thread is the bottleneck; everything else adapts to its cadence.

Multi-Producer Scenarios

When multiple threads feed the audio pipeline—say, a network thread receiving remote audio and a local mic thread—we need MPSC (multi-producer single-consumer). The lock-free approach: each producer gets its own SPSC queue, and the consumer round-robins or priority-merges.

Alternatively, use a lock-free MPSC queue with CAS (compare-and-swap) on the head index. Producers compete via atomic compare-exchange:

size_t current_head, next_head;
do {
  current_head = head_.load(std::memory_order_acquire);
  next_head = (current_head + 1) & mask_;
  if (next_head == tail_.load(std::memory_order_acquire)) return false;
} while (!head_.compare_exchange_weak(current_head, next_head, std::memory_order_release));
buffer_[current_head] = item;

The CAS loop retries if another producer modified head between the load and store. On modern ARM, this is LDAXR/STLXR (load-acquire exclusive, store-release exclusive), costing 10-20 cycles per attempt. With low contention (2-3 producers), success rate is >95% on first try.

Practical Implementation in Swift and Kotlin

Swift lacks std::atomic but provides atomic primitives via Atomics module (Swift 5.9+). For earlier versions, use os_unfair_lock for the parameter queue (it's a spinlock, not a mutex—priority inheritance prevents inversion) and reserve lock-free for the critical audio path using UnsafeAtomic.

Kotlin/Native on Android can use C++ interop for atomic operations. Pure Kotlin uses java.util.concurrent.atomic.AtomicInteger for indices. The Android audio path (AAudio or OpenSL ES) similarly runs on a high-priority thread; the same SPSC pattern applies.

In production, I've shipped this pattern in HearingAid Pro, where DSP parameters update 60 times per second from a SwiftUI interface while the audio callback runs at 93.75Hz (512 samples at 48kHz). Zero audio glitches over 18 months in production, even under thermal throttling or background app refresh.

Benchmarking Lock-Free vs Mutex

On iPhone 14 Pro (A16), I measured parameter queue latency:

pthread_mutex: 180ns median, 12µs p99 (priority inversion spikes)
os_unfair_lock: 45ns median, 320ns p99
Lock-free SPSC: 18ns median, 28ns p99

The lock-free version is 10× faster in the median case and 400× faster at p99. More importantly, the tail latency is bounded—no pathological cases where a low-priority thread stalls the audio thread.

When Not to Use Lock-Free

Lock-free isn't free. The code is harder to reason about, harder to debug (race conditions manifest as subtle corruption), and cache-line bouncing between cores can hurt throughput. Use it when:

One thread is real-time and cannot tolerate blocking
Contention is low (SPSC or 2-3 producers)
Data structures are simple (queues, stacks, not trees)

For bulk parameter updates (loading a new preset with 50 coefficients), a mutex is fine—do it outside the audio callback. The lock-free queue is for hot-path, per-buffer updates: gain adjustments, adaptive filtering, dynamic range compression.

Conclusion

Lock-free audio queues are essential infrastructure for real-time mobile DSP. By eliminating priority inversion and bounding worst-case latency, they enable responsive audio applications—hearing aids, voice processing, music synthesis—that maintain sub-10ms glass-to-glass latency even under system load. The cost is careful engineering: memory ordering, alignment, and testing under adversarial conditions. But for applications where audio glitches are unacceptable, the tradeoff is clear.