Incremental OCR Streaming: 80ms First-Token Latency

Traditional OCR pipelines—capture frame, run inference, return full result—impose 400–800ms latency before users see any text. For receipt scanning, document digitization, or AR translation overlays, that delay destroys the illusion of real-time interaction. Incremental streaming flips the model: emit partial results as soon as the first text line decodes, then refine in place. We've shipped this pattern in production OCR apps processing millions of frames, cutting perceived latency to sub-100ms and enabling cancel-early workflows that save 60% inference cost.

Why Monolithic OCR Fails Mobile UX

Tesseract, PaddleOCR, and most cloud APIs return a single JSON blob after processing the entire image. On a mid-tier Android device, a 1080×1920 receipt photo takes ~600ms for text detection + ~200ms for recognition. The user stares at a spinner. Worse, if they realize they photographed the wrong document, they've already burned battery and thermal budget.

The core issue: mobile ML frameworks (Core ML, TFLite, ONNX Runtime Mobile) execute models synchronously. A 50-layer CNN for text detection blocks the inference thread until the final sigmoid output. No intermediate results escape. For a pipeline with detection → line segmentation → recognition, you pay the full serial cost before returning anything.

Decomposing the OCR Pipeline

Modern OCR is a four-stage cascade:

Detection: EAST or DBNet finds text regions as rotated bounding boxes (20–40ms on-device).
Segmentation: Crop and deskew each region, optionally run a lightweight line separator (5–10ms per region).
Recognition: CRNN or Transformer encodes cropped text to character probabilities (30–80ms per line, depending on length).
Post-processing: CTC decode, language model correction, confidence filtering (10–20ms).

The insight: detection completes in 40ms and yields 5–15 text regions. Instead of queuing all regions for serial recognition, we stream them. Emit the first recognized line at T+70ms, the second at T+120ms, and so on. The UI updates incrementally—users see partial results while the model chews through remaining lines.

Streaming Architecture

We use a producer-consumer pattern with priority queues. The detection thread enqueues bounding boxes sorted top-to-bottom (reading order heuristic). A pool of recognition workers—typically 2 threads on mobile to avoid thermal throttling—dequeue regions, run CRNN inference, and publish results to a thread-safe result buffer. The UI polls this buffer at 60fps via a reactive stream (Combine, RxSwift, or Kotlin Flow).

// Swift pseudocode
class IncrementalOCR {
  let detectionQueue = DispatchQueue(label: "ocr.detect")
  let recognitionPool = DispatchQueue(label: "ocr.recognize", attributes: .concurrent)
  let resultSubject = PassthroughSubject<OCRLine, Never>()
  
  func process(image: CVPixelBuffer) {
    detectionQueue.async {
      let regions = self.detectTextRegions(image) // 40ms
      for (index, region) in regions.enumerated() {
        recognitionPool.async {
          let text = self.recognizeLine(region) // 60ms
          self.resultSubject.send(OCRLine(index: index, text: text, bbox: region))
        }
      }
    }
  }
}

The UI subscribes to resultSubject and renders each line as it arrives. For a 10-line receipt, the first line appears at 70ms (detection + first recognition), and the last at ~300ms—but the user perceives responsiveness immediately.

Handling Out-of-Order Results

Concurrent recognition means line 5 might finish before line 3. We assign each region a spatial index during detection (top-to-bottom, left-to-right for multi-column layouts). The UI maintains a sparse array keyed by index and renders lines in sorted order, leaving gaps for pending results. This avoids jarring reflows when a slow line finally arrives.

For AR overlays, we use a sliding window: only the top N visible lines stay in memory. As the user scrolls, we evict old results and enqueue new regions from the detection buffer. This caps memory at ~2MB regardless of document length.

Confidence-Gated Emission

Not all partial results deserve display. We threshold recognition confidence at 0.75—lines below that are held until a second-pass language model rescores them. This prevents flickering garbage text. In practice, 80% of lines exceed the threshold on first decode, so the streaming advantage holds.

Refinement: Correcting Early Mistakes

Incremental results may contain errors—OCR models are probabilistic. We run a two-phase correction:

Fast pass: Emit raw CRNN output with CTC decode (no LM). Latency: 60ms per line.
Refinement pass: Apply a character-level Transformer LM (GPT-2 Small quantized to INT8, 40ms overhead). Update the UI in place if the corrected text differs by >2 characters.

The UI shows an animated underline on refined lines to signal the change. Users rarely notice—most corrections are subtle ("O" → "0", "l" → "I"). For critical fields like prices or dates, we delay emission until refinement completes, trading 40ms latency for accuracy.

Cancel-Early Optimization

If the user taps "Retake Photo" 200ms into processing, we abort in-flight recognition tasks. DispatchQueue doesn't support true cancellation, so we use a shared atomic flag checked every 10ms inside the CRNN loop:

// Inside CRNN inference loop
for step in 0..<maxSteps {
  if cancellationToken.load(ordering: .acquire) { return nil }
  // Run transformer layer...
}

This cuts wasted compute by 60% in user studies—people retake photos often when angle or lighting is wrong. The cancelled tasks free GPU/NPU resources instantly, so the next frame starts fresh.

Benchmarks: Latency and Throughput

Tested on iPhone 13 Pro (A15 Bionic) with a 12-line grocery receipt:

Monolithic OCR: 680ms total, 0ms first result
Incremental streaming: 82ms first line, 310ms all lines
Perceived latency improvement: 7.3× faster to first interaction
Cancel-early savings: 58% fewer GPU cycles on retakes

On Android (Snapdragon 8 Gen 1), numbers shift to 95ms / 340ms due to slower NNAPI dispatch, but the ratio holds. The key win is psychological: users tolerate 300ms total if they see progress at 80ms.

Production Gotchas

Thermal throttling: Running 2 concurrent CRNN threads on low-end devices (iPhone SE 2020) triggers thermal limits after 15 seconds. We monitor ProcessInfo.thermalState and drop to 1 thread if it hits .serious. Latency degrades 40%, but the device stays stable.

Memory spikes: Each recognition thread holds a 15MB model instance (quantized CRNN). On 3GB RAM devices, we serialize recognition to avoid OOM. The streaming UX still works—lines arrive slower but in order.

Language model size: A 50MB unquantized LM is too heavy for real-time refinement. We distilled GPT-2 Small to 12MB INT8 and cached it in memory-mapped file. First load: 8ms. Refinement: 35–45ms per line.

When Not to Stream

Incremental OCR shines for multi-line documents where users benefit from partial results. It's overkill for single-line use cases like barcode-adjacent text or license plates—there, monolithic inference at 120ms is simpler and fine. Also, if your recognition model is