Cancellable Task Graphs in Mobile AI Pipelines

When a user swipes away from a screen running on-device inference—object detection, LLM completion, or audio transcription—the naive implementation leaks GPU memory, burns battery, and blocks the next request. Mobile AI pipelines are rarely linear: a single user action triggers preprocessing, model inference, postprocessing, and often secondary models in sequence or parallel. Without explicit cancellation propagation through this task graph, you end up with zombie work consuming resources long after the UI moved on.

This article dissects the architecture of cancellable task graphs in mobile AI, drawing from production experience shipping models in Swift, Kotlin, and Dart. We'll cover structured concurrency primitives, dependency-aware cancellation, resource cleanup guarantees, and the tradeoffs between cooperative vs preemptive cancellation in GPU-bound workloads.

Why Linear Cancellation Fails

A typical vision pipeline might look like this: decode image → resize → normalize → run CoreML model → NMS postprocessing → draw bounding boxes. If the user navigates away after resize completes, you want to cancel the CoreML invocation and everything downstream. But iOS's Task.cancel() or Android's Job.cancel() only set a flag—the model still runs unless you explicitly check Task.isCancelled at strategic points.

Worse, many inference APIs are synchronous blocking calls. CoreML's prediction(from:) doesn't yield control until the forward pass completes, even if the parent task is cancelled. Metal Performance Shaders graph execution is similar: once you call encode(to:), the GPU command buffer is queued and will execute unless you explicitly signal the command queue. In practice, this means a 300ms model invocation runs to completion even when the result is already irrelevant.

The Resource Leak Problem

On-device models hold significant resources: a quantized Stable Diffusion model might allocate 2GB of Metal buffers, a speech recognition model keeps 512MB of audio ringbuffer in shared memory, an LLM holds KV cache tensors across requests. If cancellation doesn't propagate, these buffers stay pinned until the task finishes naturally. Run three abandoned inferences in parallel and you've OOM-killed your app.

Shipping an offline LLM chat app, we observed that 40% of user-initiated completions were abandoned within the first two tokens. Without proper cancellation, the app would continue generating 50+ tokens, consuming 800ms of ANE time and preventing the next request from starting. Users experienced this as "the app ignores my new question."

Structured Concurrency as Task Graph Foundation

Structured concurrency—available in Swift via async/await, Kotlin Coroutines, and Dart isolates—provides parent-child task relationships that form a natural DAG. When a parent task cancels, all children cancel automatically. This handles the easy case: a single linear pipeline where each step is an async function.

func runPipeline(image: UIImage) async throws -> [Detection] {
  let resized = try await resize(image)
  try Task.checkCancellation()
  let normalized = try await normalize(resized)
  try Task.checkCancellation()
  let output = try await model.prediction(from: normalized)
  try Task.checkCancellation()
  return try await postprocess(output)
}

This works if each step is independently cancellable. But real pipelines have branches: run face detection and object detection in parallel, then merge results. Or run a cheap classifier first, then conditionally invoke an expensive model. These require explicit graph management.

Parallel Task Groups with Selective Cancellation

Swift's TaskGroup and Kotlin's coroutineScope let you spawn multiple children and wait for all or any to complete. For parallel model execution, you want to cancel the group if any single model fails or if the parent cancels, but allow successful branches to complete if they finish before cancellation propagates.

func detectObjects(image: UIImage) async throws -> CombinedResult {
  try await withThrowingTaskGroup(of: PartialResult.self) { group in
    group.addTask { try await faceDetector.run(image) }
    group.addTask { try await objectDetector.run(image) }
    
    var faces: [Face] = []
    var objects: [Object] = []
    
    for try await result in group {
      switch result {
      case .faces(let f): faces = f
      case .objects(let o): objects = o
      }
    }
    
    return CombinedResult(faces: faces, objects: objects)
  }
}

If the parent task cancels mid-execution, withThrowingTaskGroup cancels both child tasks immediately. But if faceDetector is already in a synchronous CoreML call, it won't stop until that call returns. This is where cooperative cancellation shows its limits.

Preemptive Cancellation for GPU Workloads

For long-running GPU work, cooperative cancellation isn't enough. You need to cancel the GPU command buffer itself. On iOS, Metal command buffers support addCompletedHandler and explicit cancellation via command queue management. The pattern: submit work with a cancellation token, and if the token fires before the GPU finishes, submit a no-op command buffer to flush the queue and immediately mark the original work as cancelled.

class CancellableInference {
  private var commandBuffer: MTLCommandBuffer?
  
  func run(input: MTLTexture) async throws -> MTLTexture {
    let buffer = queue.makeCommandBuffer()!
    commandBuffer = buffer
    
    // Encode model inference
    encoder.encode(to: buffer)
    
    return try await withTaskCancellationHandler {
      try await withCheckedThrowingContinuation { continuation in
        buffer.addCompletedHandler { cb in
          if cb.status == .cancelled {
            continuation.resume(throwing: CancellationError())
          } else {
            continuation.resume(returning: outputTexture)
          }
        }
        buffer.commit()
      }
    } onCancel: {
      // Signal GPU to abandon work
      commandBuffer?.cancel()
    }
  }
}

This approach reduced tail latency in a real-time object detection app by 60%. When users rapidly swiped through camera frames, abandoned inferences would cancel within 5ms instead of blocking for 180ms. The GPU queue stayed responsive, and the next frame's inference started immediately.

Resource Cleanup Guarantees

Cancellation must trigger explicit cleanup. In Swift, defer blocks run even on cancellation, making them ideal for releasing buffers. In Kotlin, finally blocks serve the same purpose. For shared GPU resources like Metal heaps or ONNX Runtime sessions, use reference counting with automatic cleanup on task exit.

func processAudio(buffer: AVAudioPCMBuffer) async throws -> Transcript {
  let session = try await sessionPool.acquire()
  defer { sessionPool.release(session) }
  
  let preprocessed = try await preprocess(buffer)
  try Task.checkCancellation()
  
  let result = try await session.run(input: preprocessed)
  return try await decode(result)
}

In a speech recognition app processing 30-second audio clips, this pattern ensured that ONNX Runtime sessions—each holding 400MB of weight tensors—were always returned to the pool within 10ms of cancellation, even if the task was cancelled mid-inference.

Conditional Execution and Early Exit

Many AI pipelines have a "cheap classifier first" pattern: run a 5ms MobileNet to decide if a 200ms ResNet is needed. If the cheap model's confidence is high, skip the expensive one. Structured concurrency makes this natural with early returns.

func classifyImage(image: UIImage) async throws -> Classification {
  let quick = try await mobileNet.run(image)
  try Task.checkCancellation()
  
  if quick.confidence > 0.95 {
    return quick.topClass
  }
  
  let detailed = try await resNet.run(image)
  return detailed.topClass
}

This saved 85% of inference time in a production e-commerce image tagging pipeline. Most product photos are unambiguous; only 12% required the expensive model. By checking cancellation between stages, we ensured that if the user navigated away during the cheap inference, the expensive one never started.

Cross-Platform Patterns

Flutter's isolate model requires explicit message passing, so cancellation tokens must be sent as messages. A typical pattern: spawn an isolate for inference, send a cancellation port along with the input, and listen on that port for abort signals.

class InferenceIsolate {
  Future run(Input input, CancellationToken token) async {
    final receivePort = ReceivePort();
    final isolate = await Isolate.spawn(_inferenceWorker, [
      input,
      receivePort.sendPort,
      token.port,
    ]);
    
    token.onCancel = () {
      isolate.kill(priority: Isolate.immediate);
    };
    
    return await receivePort.first as Result;
  }
}

Killing the isolate is preemptive but crude—in-flight GPU work continues until the next kernel launch. For finer control, poll the cancellation port between model layers if the framework supports layer-by-layer execution (TensorFlow Lite delegates do; ONNX Runtime less so).

Observability and Debugging

Cancelled tasks are invisible in crash logs, making them hard to debug. Instrument cancellation points with structured logging: record task ID, cancellation reason (user navigation, timeout, error), and resource state at cancellation time. In production, we found that 15% of cancellations happened during GPU command buffer encoding—a sign that the previous inference hadn't cleaned up properly.

Use Xcode's Instruments "Points of Interest" or Android Profiler custom events to visualize task lifetimes and cancellation propagation. A flame graph showing when child tasks actually stop relative to parent cancellation reveals where cooperative cancellation is too slow.

Tradeoffs and Gotchas

Preemptive GPU cancellation isn't free: Metal command buffer cancellation adds 2-5ms of synchronization overhead. For inferences under 20ms, this overhead dominates, so cooperative cancellation is better. For 100ms+ models, preemptive wins decisively.

Shared resource pools (ONNX sessions, Metal heaps) must be thread-safe and cancellation-aware. A naive pool might hand out a session that's mid-cancellation, leading to corrupted state. Use atomic reference counts and mark sessions as "cancelling" to prevent reuse until cleanup completes.

Finally, cancellation is not rollback. If your pipeline writes to a database or uploads partial results, cancellation won't undo those side effects. Design pipelines to be idempotent or use transactional patterns where atomicity matters.

Conclusion

Cancellable task graphs transform mobile AI from a resource leak minefield into a responsive, efficient system. Structured concurrency provides the foundation, preemptive GPU cancellation handles the hard cases, and explicit resource cleanup guarantees prevent leaks. In production, these patterns reduced inference-related memory pressure by 70% and improved UI responsiveness by eliminating blocked queues. The cost: 200 lines of cancellation infrastructure and careful testing of edge cases—a small price for apps that respect user intent and device resources.