Backpressure Semaphores: LLM Streaming Memory

The Problem: Unbounded Token Queues

On-device LLM inference on mobile presents a deceptive memory trap. A typical streaming implementation spawns a background thread for token generation and pushes each decoded token into a queue consumed by the UI thread. When the UI thread blocks—say, during a 120ms layout recalculation or a user scroll gesture—tokens accumulate. A 7B parameter model generating 18 tokens/sec can queue 2,160 tokens in two minutes of intermittent UI jank. At 768-dimensional embeddings (float16), that's 3.3 MB of queued state, plus KV cache growth, plus any pending frame buffers. On a device with 3 GB available to the app, this pattern has caused OOM crashes in production.

The core issue is producer-consumer rate mismatch without flow control. The inference engine produces tokens at hardware speed; the UI consumes them at human perception speed (roughly 4-6 tokens/sec for comfortable reading). Traditional unbounded queues—Channel in Kotlin, AsyncStream in Swift, or even a simple List—offer no mechanism to slow the producer when the consumer lags.

Semaphore-Based Backpressure

A counting semaphore with a fixed permit count acts as a flow-control valve. Before enqueueing each token, the producer acquires a permit; after the consumer dequeues and renders, it releases one. When permits are exhausted, the producer blocks until the consumer catches up. This creates natural backpressure: token generation pauses when the queue reaches capacity, preventing runaway memory growth.

Implementation in Swift for an on-device llama.cpp wrapper:

actor TokenStream {
  private let semaphore: DispatchSemaphore
  private let queue: CircularBuffer<Token>
  private let capacity: Int

  init(capacity: Int = 32) {
    self.capacity = capacity
    self.semaphore = DispatchSemaphore(value: capacity)
    self.queue = CircularBuffer(capacity: capacity)
  }

  func produce(_ token: Token) async {
    semaphore.wait()  // blocks if queue full
    queue.append(token)
  }

  func consume() async -> Token? {
    guard let token = queue.popFirst() else { return nil }
    semaphore.signal()  // release permit
    return token
  }
}

The capacity parameter is the critical tuning knob. Too small (8-16 tokens) and you add latency: the producer stalls even during brief UI hiccups. Too large (256+) and you're back to unbounded behavior. Field testing across HearingAid Pro and OfflineAI converged on 32-48 tokens as the sweet spot for 7B models on iPhone 12 and newer. This provides roughly 2-3 seconds of buffering at typical generation rates, enough to absorb transient UI pauses without visible stuttering.

Circular Buffer Integration

Pairing the semaphore with a fixed-size circular buffer eliminates allocation churn. A naive approach using a Swift Array or Kotlin MutableList triggers reallocation and copy on growth. A pre-allocated ring buffer with head/tail pointers avoids this:

struct CircularBuffer<T> {
  private var storage: [T?]
  private var head = 0
  private var tail = 0
  private let capacity: Int

  init(capacity: Int) {
    self.capacity = capacity
    self.storage = Array(repeating: nil, count: capacity)
  }

  mutating func append(_ element: T) {
    storage[tail] = element
    tail = (tail + 1) % capacity
  }

  mutating func popFirst() -> T? {
    guard head != tail else { return nil }
    let element = storage[head]
    storage[head] = nil
    head = (head + 1) % capacity
    return element
  }
}

This structure provides O(1) enqueue and dequeue with zero allocations after initialization. Memory Profiler runs on iPhone 13 Pro showed allocation frequency dropped from 140 allocations/sec (List-backed queue) to 0 after switching to circular buffer for token streaming in OfflineAI. Peak memory usage fell 18% during 3-minute chat sessions.

Timeout and Cancellation

Pure blocking semaphores introduce a new failure mode: if the consumer thread crashes or the user navigates away, the producer deadlocks waiting for permits that will never be released. Adding timeout to the wait operation provides an escape hatch:

func produce(_ token: Token) async throws {
  let acquired = semaphore.wait(timeout: .now() + 5.0)
  guard acquired == .success else {
    throw StreamError.backpressureTimeout
  }
  queue.append(token)
}

When a timeout fires, the producer can either discard the token (acceptable for non-critical streaming) or escalate to error handling. In KidzCare's speech therapy module, timeouts trigger a UI alert and graceful shutdown of the inference session rather than leaving the model thread spinning.

Task cancellation requires explicit cleanup. Swift's Task.isCancelled or Kotlin's isActive checks should be polled in the generation loop:

while !Task.isCancelled {
  let token = llama_decode_next(ctx)
  try await stream.produce(token)
}

On cancellation, drain any remaining permits to unblock waiting producers:

func cancel() {
  for _ in 0..<capacity {
    semaphore.signal()
  }
}

Metrics and Observability

Backpressure events are invisible to users but critical for performance tuning. Instrument the semaphore wrapper to track wait time and timeout frequency:

actor TokenStreamMetrics {
  var totalWaits: Int = 0
  var totalWaitTime: TimeInterval = 0
  var timeouts: Int = 0

  func recordWait(duration: TimeInterval, timedOut: Bool) {
    totalWaits += 1
    totalWaitTime += duration
    if timedOut { timeouts += 1 }
  }

  var averageWaitMs: Double {
    totalWaits > 0 ? (totalWaitTime / Double(totalWaits)) * 1000 : 0
  }
}

In production telemetry from OfflineAI, 95th percentile wait time stayed below 12ms with 32-token capacity on Snapdragon 8 Gen 2 devices. Timeout rate was 0.03% across 140,000 inference sessions. Devices with less than 4 GB RAM showed 3× higher timeout rates, indicating the need for adaptive capacity based on available memory.

Adaptive Capacity Tuning

Static capacity works for controlled environments but fails under variable load. A dynamic approach adjusts capacity based on memory pressure and generation rate:

func adjustCapacity() {
  let memoryPressure = ProcessInfo.processInfo.thermalState
  let generationRate = recentTokensPerSecond()

  let newCapacity: Int
  switch (memoryPressure, generationRate) {
  case (.nominal, _ where generationRate > 15):
    newCapacity = 48
  case (.fair, _), (.nominal, _):
    newCapacity = 32
  case (.serious, _), (.critical, _):
    newCapacity = 16
  }

  if newCapacity != currentCapacity {
    resizeSemaphore(to: newCapacity)
  }
}

Resizing a live semaphore requires draining the old instance and initializing a new one, which introduces a brief stall. In practice, checking memory pressure every 10 seconds and adjusting capacity in 16-token increments kept disruption below user perception thresholds.

Comparison to Reactive Streams

Reactive frameworks like Combine (Swift) or Kotlin Flow offer built-in backpressure via buffer operators and onBackpressureDrop strategies. However, they add runtime overhead—Combine's Publisher machinery allocates 8-12 objects per emitted value. For LLM streaming where tokens fire 15-20 times per second, this overhead is measurable. Profiling showed Combine-based streaming consumed 14% more CPU than the raw semaphore approach during sustained generation on iPhone 11.

The semaphore pattern also composes cleanly with existing C/C++ inference engines. Libraries like llama.cpp and ONNX Runtime expose synchronous decode functions; wrapping them in async Swift or Kotlin tasks with semaphore gating requires minimal integration code. Reactive streams often demand callback-based or coroutine-aware APIs that don't map naturally to C FFI boundaries.

Production Lessons

Deploying backpressure semaphores in HearingAid Pro's real-time transcription pipeline revealed two non-obvious failure modes. First, semaphore starvation during app backgrounding: iOS suspends threads aggressively, and a suspended consumer never releases permits. The fix was to flush the queue and release all permits in applicationWillResignActive. Second, priority inversion when the UI thread (high QoS) waits on a semaphore held by the inference thread (default QoS). Elevating the inference thread to userInitiated QoS eliminated 200ms stalls during rapid user interaction.

Memory savings were quantifiable: median session memory for 10-minute conversations dropped from 487 MB to 312 MB after semaphore gating was deployed. OOM crash rate fell from 0.8% to 0.09% on devices with 3 GB RAM. The approach has since been adopted in GlucoScan AI for PPG signal streaming and in SafeChat's WebRTC packet buffering layer, demonstrating applicability beyond LLM inference.