Chroma Subsampling in Mobile OCR: 4:2:0→Luma

Modern mobile OCR pipelines—especially those running ONNX or CoreML text-detection models—spend 40–60% of their frame budget moving pixel data between camera, CPU, and neural accelerator. A single 1080p YUV frame at 4:2:0 chroma subsampling consumes 3.1 MB; at 60fps that's 186 MB/s of memory bandwidth before any inference happens. For battery-constrained devices, this is untenable.

The insight: most mobile OCR models—PaddleOCR, EasyOCR, Tesseract's LSTM backend—were trained on grayscale or luma-only data. Chroma channels (Cb, Cr) contribute negligible signal for Latin, Arabic, CJK scripts. By stripping chroma at the earliest pipeline stage, we halve memory traffic and unlock 25–35% latency wins on mid-range Android devices. This article dissects the technique, tradeoffs, and failure modes observed shipping Khosomati's real-time price-tag scanner.

The 4:2:0 Tax in Mobile Camera Pipelines

iOS AVCaptureSession and Android Camera2 API default to YUV 4:2:0 (NV12 or NV21 pixel format). Each 2×2 luma block shares one Cb and one Cr sample. For a 1920×1080 frame:

Y plane: 1920 × 1080 = 2,073,600 bytes
UV plane: (1920 × 1080) / 2 = 1,036,800 bytes
Total: 3,110,400 bytes per frame

Most OCR preprocessing converts YUV→RGB→grayscale via:

Gray = 0.299R + 0.587G + 0.114B

This requires full YUV→RGB conversion (matrix multiply per pixel), then weighted sum. On ARM Cortex-A55 cores without NEON intrinsics, this adds 8–12ms at 1080p. The chroma data is read, converted, then discarded—pure waste.

Luma-Only Path: Drop Chroma Before Conversion

The optimization: extract Y plane directly, skip UV entirely. YUV's luma channel approximates perceptual grayscale closely enough for text recognition. Implementation in Flutter (via platform channels to native code):

// iOS: CVPixelBuffer → vImage
let yPlane = CVPixelBufferGetBaseAddressOfPlane(pixelBuffer, 0)
let width = CVPixelBufferGetWidthOfPlane(pixelBuffer, 0)
let height = CVPixelBufferGetHeightOfPlane(pixelBuffer, 0)
let bytesPerRow = CVPixelBufferGetBytesPerRowOfPlane(pixelBuffer, 0)

var srcBuffer = vImage_Buffer(
  data: yPlane,
  height: vImagePixelCount(height),
  width: vImagePixelCount(width),
  rowBytes: bytesPerRow
)
// Pass srcBuffer directly to ONNX preprocessing

On Android, extract the Y plane from Image.Plane[0] and ignore planes 1 and 2. Memory bandwidth drops from 3.1 MB to 2.0 MB per frame—a 35% reduction. Cache pressure eases; on Snapdragon 7-series SoCs with 2 MB L3, this keeps more model weights resident.

Model Compatibility and Accuracy Impact

We A/B tested luma-only preprocessing on 12,000 price-tag images (Arabic and English text, varying lighting). Model: PaddleOCR-mobile v2.6, quantized INT8. Results:

Character accuracy: 96.2% (baseline) → 95.8% (luma-only), −0.4pp
Word accuracy: 91.7% → 91.1%, −0.6pp
Median latency: 47ms → 32ms, −32%

The accuracy drop stems from two sources. First, colored text on colored backgrounds (red on orange, blue on purple) loses contrast in grayscale. Second, certain Arabic diacritics rendered in lighter ink disappear when luma thresholding is aggressive. For 98% of retail price tags—black text on white/yellow paper—the difference is undetectable.

When Chroma Matters

Luma-only fails for:

Color-coded forms (red/green checkboxes)
Highlighted text where hue encodes semantic meaning
Low-contrast color pairs (cyan on white, yellow on cream)
Watermarked documents where the watermark is isoluminant with text

For these cases, retain full RGB. Implement adaptive logic: run edge-detection (Sobel) on the Y plane; if edge density is below threshold (< 5% of pixels), fall back to RGB. This hybrid path adds 2ms overhead but preserves accuracy on tricky inputs.

Pipeline Integration: Minimize Copies

The win evaporates if you copy the Y plane multiple times. Optimal flow:

Lock camera buffer (CVPixelBufferLockBaseAddress or Image.Planes)
Pass Y plane pointer directly to ONNX Runtime's OrtValue via MemoryInfo with device type CPU
Let ONNX's preprocessing ops (Resize, Normalize) operate in-place or with single allocation
Unlock buffer immediately after inference submit

Avoid intermediate Mat allocations (OpenCV) or Bitmap copies (Android). On a Pixel 6, eliminating one 1080p memcpy saves 4ms. Chain this with zero-copy Metal/Vulkan texture uploads if your OCR model supports GPU inference—though most mobile text detectors are CPU-bound due to irregular memory access patterns in RNN layers.

Gamma and Perceptual Considerations

YUV's Y channel is gamma-compressed (Rec. 709 or Rec. 2020). OCR models trained on sRGB images expect gamma-corrected luminance. The mismatch is minor—Y ≈ 0.299R' + 0.587G' + 0.114B' where primes denote gamma-encoded values. For linear-light operations (HDR processing), you'd need to degamma, but text recognition is inherently perceptual. We measured no accuracy delta skipping degamma.

One gotcha: iPhone 12 Pro and newer default to Rec. 2020 wide-gamut capture when lighting is bright. The Y coefficients shift slightly (0.2627R + 0.678G + 0.0593B). If your model was trained on Rec. 709, consider forcing AVCaptureDevice.activeColorSpace = .sRGB to avoid subtle color-space drift.

Battery and Thermal Impact

Reducing memory bandwidth directly improves battery life. On a Snapdragon 778G running continuous OCR at 30fps, luma-only preprocessing cut DRAM power from 620mW to 410mW (measured via on-die PMU). Over a 10-minute scanning session, this translated to 3% less battery drain. Thermal headroom also increased—sustained workload before throttling rose from 8 minutes to 11 minutes, allowing higher camera resolution or concurrent LLM inference.

Cross-Platform Nuances

Flutter's camera plugin (camera ^0.10.0) exposes ImageFormatGroup.yuv420 on both iOS and Android, but plane layouts differ. iOS uses biplanar NV12 (Y, interleaved UV). Android uses triplanar YUV420 (Y, U, V separate). Abstraction code:

Uint8List extractLuma(CameraImage image) {
  if (Platform.isIOS) {
    return image.planes[0].bytes; // Y plane
  } else {
    // Android: Y plane is planes[0]
    return image.planes[0].bytes;
  }
}

Both paths yield identical luma data. Stride handling is critical—planes[0].bytesPerRow may exceed width due to alignment. Pass stride to ONNX or pad/crop accordingly.

Real-World Results: Khosomati Price Scanner

Khosomati aggregates grocery prices by scanning shelf tags. Users point their phone at a price label; the app extracts product name, price, and barcode in real time. Original pipeline: 1080p RGB, 52ms median latency, 18fps effective (camera at 30fps, inference at 19fps). After luma-only optimization: 34ms median latency, 28fps effective. The 14ms savings enabled concurrent barcode detection (ZXing) without dropping frames. User-reported "snappiness" improved; App Store rating climbed from 4.1 to 4.6 over two months post-launch.

Tooling and Debugging

Validating luma extraction: dump the Y plane to a PGM file (Portable GrayMap), open in ImageMagick or GIMP. Compare against full RGB→grayscale conversion. PSNR should exceed 40 dB for typical scenes. If PSNR < 35 dB, suspect stride misalignment or plane indexing bugs.

For latency profiling, use Instruments' Metal System Trace (iOS) or Perfetto (Android) to confirm memory bandwidth reduction. Look for decreased DRAM read transactions in the camera → inference window. If bandwidth doesn't drop, you're still copying chroma somewhere—audit every CVPixelBufferLockBaseAddress or Image.Plane access.

When to Skip This Optimization

If your OCR model is GPU-bound (rare for mobile, common for server-side Transformer models), memory bandwidth is less critical. If you need color for semantic parsing (e.g., red "URGENT" stamps), keep RGB. If frame rate is already 60fps and latency < 16ms, the added code complexity isn't worth it. But for battery-sensitive, real-time mobile OCR—especially in emerging markets on mid-range hardware—luma-only preprocessing is table stakes.