Morphological Dilation in Mobile OCR: Edge Repair

Mobile OCR pipelines face a brutal reality: real-world documents arrive with inconsistent lighting, compression artifacts, motion blur, and lens distortion. A pristine lab dataset yields 98% character accuracy; production traffic from a budget Android phone under fluorescent office lighting drops to 84%. The gap isn't model quality—it's preprocessing. Specifically, broken character edges where binarization thresholds slice through stroke pixels, leaving disconnected segments that confuse even transformer-based recognizers.

Morphological dilation, a foundational computer vision operation, repairs these fractures by expanding foreground regions. In Khosomati, a price-aggregation app processing 40,000 grocery receipts daily, adding a tuned dilation pass lifted real-world accuracy from 86.2% to 94.1% on Arabic and English mixed text. The cost: 12 milliseconds per frame on a Snapdragon 778G. This article dissects when dilation helps, how to parameterize it for mobile constraints, and where it fails.

Why Binarization Breaks Characters

Adaptive thresholding—Otsu, Niblack, Sauvola—converts grayscale images to binary by choosing per-pixel or per-region cutoffs. When a character stroke sits near the threshold, noise or compression can fragment it. A lowercase 'e' becomes two arcs. A '5' loses its horizontal bar. These aren't OCR model failures; the input geometry is genuinely ambiguous.

Traditional solutions include Gaussian blurring before binarization or multi-scale thresholding. Blurring costs 8–15ms on mobile and risks merging adjacent characters in dense text. Multi-scale methods triple compute. Morphological operations, by contrast, work directly on binary masks with integer arithmetic, leveraging CPU SIMD or GPU compute shaders for 3–5× speedup over naive implementations.

Dilation Mechanics: Structuring Elements

Dilation expands every foreground pixel by a structuring element—a small kernel defining the expansion shape. A 3×3 square kernel adds one pixel in all eight directions. A 5×5 cross kernel extends horizontally and vertically but not diagonally. The choice matters:

Square kernels thicken strokes uniformly, reconnecting diagonal breaks in scripts like Hindi or Arabic where diacritics hover near baselines.
Cross kernels preserve aspect ratio better, avoiding the 'bloated' look in Latin sans-serif fonts where horizontal and vertical strokes differ in weight.
Elliptical kernels approximate isotropic expansion but require floating-point operations, killing mobile performance.

In practice, a 3×3 square kernel handles 80% of cases. For receipts with 8pt font at 150 DPI capture resolution, a single dilation pass reconnects most breaks without merging adjacent characters. At 200+ DPI, a 5×5 kernel becomes necessary, but latency jumps to 18ms on mid-tier devices.

Implementation: SIMD and GPU Paths

OpenCV's cv::dilate uses NEON intrinsics on ARM, processing 16 pixels per instruction. A naive loop over a 1080×1920 binary image takes 35ms; the SIMD path drops to 9ms. For Flutter apps, calling into native OpenCV via FFI introduces 2–3ms marshalling overhead. Platform channels add another 1ms. Total: 12ms, acceptable for 60fps UI if OCR runs off the main thread.

Metal and Vulkan compute shaders can parallelize dilation across GPU cores, hitting 4ms on an A15 Bionic. The trade-off: 8–12ms GPU initialization and memory transfer on the first frame, plus 2–4ms per subsequent frame for texture uploads. For batch processing (scanning a multi-page document), GPU wins. For single-shot capture (photographing a business card), CPU SIMD avoids the cold-start penalty.

Code Sketch: CPU Dilation in Swift

import Accelerate

func dilate(_ input: vImage_Buffer, kernel: [UInt8]) -> vImage_Buffer {
    var output = try! vImage_Buffer(width: input.width, height: input.height, bitsPerPixel: 8)
    vImageDilate_Planar8(&input, &output, 0, 0, kernel, 3, 3, vImage_Flags(kvImageNoFlags))
    return output
}

Accelerate's vImageDilate_Planar8 wraps NEON intrinsics, handling edge cases and buffer alignment. A 3×3 kernel is nine UInt8 values, typically all 1s for a square structuring element.

When Dilation Fails: Merged Characters

Over-dilation collapses inter-character gaps. A 7×7 kernel on 10pt text at 150 DPI merges 'rn' into 'm', 'cl' into 'd'. The OCR model sees valid geometry but wrong semantics. Precision drops even as individual character strokes improve.

The fix: conditional dilation based on connected component analysis. Before dilating, segment the binary mask into blobs. If a blob's bounding box is smaller than expected character size (e.g., width < 8px for 10pt text), apply dilation. Larger blobs skip it. This adds 6ms for component labeling but prevents false merges. In KidzCare, a speech therapy app analyzing handwritten worksheets, conditional dilation reduced 'rn'/'m' confusion from 12% to 2% without sacrificing broken-stroke repair.

Morphological Closing: Dilation + Erosion

Closing—dilation followed by erosion with the same kernel—fills interior gaps (like the counter of 'o' or 'e') while restoring original stroke width. Erosion shrinks the dilated mask, but internal holes stay filled because dilation bridged them first. This costs 2× the latency (24ms total) but handles a wider range of defects.

For receipts with crumpled paper or inkjet smudging, closing outperforms dilation alone by 3–5 percentage points. For clean documents, the gains don't justify the latency. GlucoScan AI, which processes PPG waveforms but also scans glucose meter screens via OCR, uses closing only when initial OCR confidence falls below 0.7, a heuristic that keeps P95 latency under 18ms.

Parameter Tuning: Kernel Size and Iterations

Kernel size and iteration count are the two knobs. A single pass with a 5×5 kernel equals roughly two passes with a 3×3 kernel in terms of expansion distance, but the former is faster (one memory sweep vs. two). However, multiple small passes can adapt mid-stream: after the first dilation, re-run connected component analysis and skip the second pass for already-connected regions.

Empirical tuning on 5,000 annotated receipt images in Khosomati revealed:

3×3 kernel, 1 iteration: 86.2% → 91.4% accuracy, 12ms
3×3 kernel, 2 iterations: 86.2% → 93.1%, 22ms
5×5 kernel, 1 iteration: 86.2% → 94.1%, 18ms
Morphological closing (3×3): 86.2% → 93.8%, 24ms

The 5×5 single-pass configuration won on the accuracy-latency Pareto frontier for that dataset. Different fonts, DPI, and lighting conditions shift the optimum.

Integration with Modern OCR Pipelines

Transformer-based OCR models (TrOCR, PaddleOCR) learn robust features from large datasets, reducing reliance on perfect preprocessing. But on-device inference constraints force quantized or distilled models, which lose some robustness. Morphological preprocessing compensates, especially for underrepresented scripts or edge cases outside the training distribution.

In practice, dilation sits between binarization and the OCR model. The pipeline:

Capture frame (Camera2 API, AVFoundation)
Perspective correction (homography)
Grayscale conversion
Adaptive thresholding (Sauvola)
Morphological dilation (3×3 or 5×5)
ONNX Runtime inference (TrOCR or PaddleOCR)

Total latency on a Pixel 6: 45ms (8ms capture, 6ms perspective, 2ms grayscale, 9ms threshold, 12ms dilation, 8ms inference). At 60fps UI, this leaves 17ms headroom for rendering and user input handling.

Alternatives and Trade-offs

Bilateral filtering preserves edges while smoothing noise, but costs 40–60ms on mobile. Deep learning denoisers (trained CNNs) require separate model inference, adding 15–25ms and memory overhead. Morphological operations are deterministic, interpretable, and cheap, making them a pragmatic first line of defense.

For applications where latency trumps accuracy—live camera preview with on-screen OCR overlay—skip dilation and accept lower accuracy. For batch processing (scanning a stack of documents), run dilation on all frames. The right choice depends on whether the user waits for results or expects real-time feedback.

Practical Recommendations

Start with a 3×3 square kernel and single dilation pass. Profile on representative devices (not flagship phones). If accuracy gains exceed 5 percentage points and P95 latency stays under 20ms, ship it. If merging occurs, switch to conditional dilation with connected component gating. For multi-script apps (Arabic + English, Hindi + numerals), test kernel shapes: cross kernels often work better for mixed directionality.

Morphological dilation won't rescue a fundamentally bad OCR model or compensate for 480p capture resolution. But for the 10–15% accuracy gap between lab and field, it's a high-leverage, low-cost intervention that ships in kilobytes of code and runs in single-digit milliseconds.