Real-time optical character recognition on mobile devices faces a fundamental tension: users expect instant feedback as they scan documents, yet traditional OCR pipelines process entire images in batch mode. For applications like receipt scanners or document digitizers, this creates jarring UX where text appears in chunks seconds after capture. The solution lies in sliding window decoders—a streaming architecture that delivers character-level results at 60fps while maintaining accuracy competitive with batch processing.
The Batch OCR Bottleneck
Standard mobile OCR follows a three-stage pipeline: image preprocessing (binarization, deskewing), text detection (locating bounding boxes), and recognition (decoding characters). On a mid-range Android device, processing a 4000×3000 receipt image through a CRNN-CTC model takes 800-1200ms. The recognition stage dominates—a bidirectional LSTM decoder with CTC loss must process the entire feature map before emitting any output.
This latency compounds in video scenarios. Scanning a long receipt at 30fps generates frames faster than the model can process them, forcing frame drops or queue buildup. Users see frozen previews or stale results. Batching multiple frames improves throughput but worsens latency—the classic real-time systems tradeoff.
Sliding Window Architecture
A sliding window decoder treats the input as a temporal stream rather than a spatial image. Instead of waiting for the full frame, we:
- Divide the input into overlapping horizontal strips (typically 256×64 pixels)
- Process strips sequentially through the recognition model
- Merge predictions using a beam search decoder with overlap resolution
- Emit characters as soon as confidence exceeds threshold
The key insight: most text in documents flows horizontally. A strip containing 8-12 characters provides sufficient context for accurate decoding without requiring the full page. By overlapping strips by 25-30%, we ensure no character spans a boundary without context.
Feature Extraction Pipeline
The CNN backbone runs once per frame, extracting a feature map at 1/4 resolution. For a 1920×1080 input, this yields a 480×270 feature tensor. We then slice this tensor into strips with 32-pixel overlap:
strip_height = 64
stride = 48 // 25% overlap
for y in range(0, feature_height - strip_height, stride):
strip = features[y:y+strip_height, :]
decode_async(strip)Each strip enters a separate decode queue. On devices with multiple performance cores, we can process 2-3 strips concurrently. The CNN forward pass (150ms) amortizes across all strips, while decode latency drops to 40-60ms per strip.
CTC Beam Search with Overlap Merging
Connectionist Temporal Classification (CTC) emits a probability distribution over characters plus a blank token at each timestep. Standard beam search maintains top-K hypotheses, pruning low-probability paths. With sliding windows, we extend this to handle strip boundaries:
- Prefix carryover: When starting strip N+1, initialize beam with the top-3 suffixes from strip N
- Confidence gating: Only emit characters when the cumulative log-probability exceeds -2.5 (tuned empirically)
- Overlap voting: Characters appearing in the overlap region of adjacent strips must agree within edit distance 1
This overlap voting is critical. Without it, boundary artifacts create doubled characters or dropped letters. We maintain a 16-character circular buffer per strip pair, comparing predictions in the overlap zone. If disagreement occurs, we defer emission until the next strip provides a tiebreaker.
Handling Variable-Width Fonts
Monospace text simplifies strip alignment—each character occupies fixed horizontal space. Proportional fonts break this assumption. A narrow 'i' and wide 'W' require different context windows. We address this with adaptive stride:
char_widths = estimate_widths(strip_features) avg_width = median(char_widths) stride = max(32, int(strip_height * 0.75 - avg_width * 2))
By estimating character density from the feature map (via a lightweight regressor head), we adjust stride dynamically. Dense text regions get smaller strides (more overlap), while sparse regions allow larger jumps. This keeps latency bounded while maintaining accuracy.
Temporal Consistency Across Frames
Video OCR introduces another dimension: consecutive frames show the same text with slight camera motion. Naive per-frame decoding produces flickering results as predictions vary frame-to-frame. We solve this with a temporal filter:
- Maintain a sliding window of the last 5 frames' predictions
- Align predictions using Needleman-Wunsch sequence alignment
- Emit only characters that appear in 3+ aligned positions
- Use optical flow to track regions across frames, avoiding re-decode of static areas
The optical flow optimization is substantial. If a text region moves less than 4 pixels between frames, we reuse its previous decode result, updating only the bounding box. On a typical receipt scan where the user holds the phone steady, 60-70% of strips are cache hits, dropping effective latency to 15-20ms per frame.
Memory and Power Constraints
Streaming architectures risk unbounded memory growth. Each strip's beam search maintains 10-20 hypotheses with associated character sequences. At 30fps with 20 strips per frame, this balloons to 12,000 active hypotheses. We cap this with:
- Hypothesis pooling: Allocate a fixed 2MB buffer for all beams, reusing memory as strips complete
- Aggressive pruning: Drop hypotheses with log-probability below -8.0 (vs. -12.0 in batch mode)
- Quantized features: Store strip features as int8 rather than float32, reducing cache footprint by 4×
Power consumption also matters. Running the CNN at 60fps drains battery quickly. We use adaptive frame rates: 60fps during active scanning (detected via gyroscope motion), dropping to 15fps when stable. Combined with DVFS (dynamic voltage-frequency scaling) hints to the OS, this keeps thermal envelope under 2W on typical devices.
Accuracy vs. Latency Tradeoffs
Streaming introduces accuracy degradation—smaller context windows provide less information for disambiguation. In production testing across 50,000 receipts, we measured:
- Batch mode: 94.2% character accuracy, 1100ms latency
- Sliding window (256px): 91.8% accuracy, 180ms latency
- Sliding window (384px): 93.5% accuracy, 280ms latency
The accuracy gap narrows with larger windows, but latency increases proportionally. For most applications, 91-92% character accuracy suffices—users tolerate minor errors if results appear instantly. The UX win from real-time feedback outweighs 2-3% accuracy loss.
Production Lessons
Shipping this in Khosomati, a price-comparison app scanning grocery receipts, revealed edge cases:
- Curved receipts: Thermal paper curls, creating perspective distortion. We added a lightweight mesh warp stage (10ms) before strip extraction.
- Multi-column layouts: Some receipts use two-column formatting. Our horizontal strips failed here—we detect columns via vertical projection and process each independently.
- Low-light noise: Dark environments amplify camera noise, confusing the feature extractor. A 3×3 bilateral filter (8ms) in preprocessing improved accuracy 4% in