OCR Price Extraction at Scale: Architecture

The Real-World OCR Problem

Most OCR tutorials stop at "we got text from an image." Production systems need structured data: extract the price "12.50" from a cluttered receipt, link it to "Organic Tomatoes," handle mixed Arabic-English layouts, and do it fast enough for mobile UX. When building Khosomati—a price aggregation app that scans supermarket receipts—we learned that the hard problems start after basic text recognition.

The pipeline processes 200,000+ images monthly across Jordan, Palestine, and the UAE. Users photograph receipts in poor lighting, crumpled paper, and mixed RTL/LTR text. The system must extract item names, prices, quantities, and totals with enough accuracy to power price comparison without manual correction.

Architecture: Three-Stage Pipeline

We decompose OCR into three distinct stages, each optimized independently:

Stage 1: Image Preprocessing on Device

Before any ML model runs, client-side preprocessing reduces noise and normalizes input. The Flutter app applies:

Perspective correction using OpenCV's findContours and getPerspectiveTransform. Receipts are often photographed at angles; we detect the largest quadrilateral and warp to a rectangle.
Adaptive thresholding with Gaussian blur (kernel size 5×5) converts color images to binary, handling uneven lighting across the receipt.
Deskewing via Hough line detection. Receipts tilted more than 2° are rotated to align text baselines horizontally.
Resolution normalization to 1200 DPI equivalent. Downsampling high-res phone cameras (48MP+) to a consistent target improves model consistency and reduces upload size by 60%.

This stage runs in 180–240ms on mid-range Android devices using Flutter's image package bindings to native OpenCV. Preprocessing failures (can't find receipt boundary) trigger a retake prompt rather than sending garbage to the server.

Stage 2: Text Detection and Recognition

We run two models in sequence rather than an end-to-end approach, trading latency for accuracy and debuggability.

Detection: A fine-tuned CRAFT (Character Region Awareness for Text) model locates text regions. CRAFT outputs character-level heatmaps, which we cluster into word bounding boxes. Fine-tuning on 12,000 Arabic supermarket receipts (collected via crowdsourcing) improved recall from 81% to 93% on curved or faded text compared to the pretrained model.

Recognition: Each bounding box crops to a TrOCR transformer (Microsoft's encoder-decoder architecture). We serve two models: one for Arabic, one for English/numerals. Language detection is trivial—Unicode ranges—but mixing models per-region would be slower. Instead, we run both in parallel on the server (8-core Xeon, 32GB RAM) and merge results based on bounding box coordinates.

TrOCR achieves 96.2% character accuracy on English prices and 91.8% on Arabic item names in our test set (2,400 annotated receipts). The gap is due to Arabic's connected letterforms and diacritic ambiguity. We accept this; downstream parsing compensates.

Stage 3: Structured Extraction via Rule-Based Parser

Raw OCR output is a list of (text, bbox, confidence) tuples. Extracting "Tomatoes → 12.50 JOD" requires spatial reasoning and domain heuristics.

Our parser:

Clusters text into rows using y-coordinate proximity (within 8px). Receipts are line-oriented; this groups related items.
Identifies price candidates via regex: \d+[.,]\d{2} with currency symbols (JOD, ILS, AED, ₪, د.أ). Confidence threshold: 0.85. Lower-confidence numbers are flagged for manual review.
Associates items to prices by scanning left (for LTR) or right (for RTL) from each price. The nearest text cluster within 100px becomes the item name. If multiple prices appear in one row (e.g., quantity × unit price = total), we take the rightmost as the line total.
Extracts metadata: store name (top 15% of image), date (regex + dateutil parsing), total (largest price near "Total" or "المجموع" keyword).

This rule-based approach is fragile but transparent. When it fails, logs show exactly which heuristic broke. An ML classifier for item-price association would be more robust but harder to debug and requires labeled training data we don't have at scale.

Handling Mixed RTL/LTR Layouts

Arabic receipts often embed English brand names ("Coca-Cola") and numeric prices in LTR. TrOCR processes text directionality correctly, but spatial parsing must account for reading direction.

We detect dominant receipt direction by counting Arabic vs. Latin characters. If Arabic > 40%, we reverse the left-to-right item-price association logic. Edge case: bilingual receipts (Arabic item name, English brand in parentheses). Here, bounding box width helps—wider boxes are likely the primary item name.

One subtle bug: Arabic thousand separators are commas (12,500) but so are decimal points in some locales (12,50). We resolve this by checking if the last group has exactly two digits (decimal) or three+ (thousands). Confidence below 0.9 triggers a fallback: prefer the interpretation that makes the price reasonable for the item category (a tomato for 1250 JOD is implausible).

Performance and Cost

End-to-end latency from image upload to structured JSON:

P50: 1.8 seconds
P95: 3.4 seconds
P99: 5.1 seconds (usually complex receipts with 30+ items)

Server costs: $0.018 per receipt (AWS EC2 c5.2xlarge reserved instance, amortized). TrOCR inference is CPU-bound; we batch up to 16 regions per request to saturate cores. GPU instances (g4dn) were 3× faster but 5× more expensive—uneconomical for our margin.

We cache preprocessed images in S3 ($0.023/GB/month) for 30 days to support user-initiated re-parsing without re-upload. This handles cases where the parser misidentified a price and the user corrects it; we reprocess the cached image with adjusted heuristics.

Accuracy in Production

We measure accuracy via user corrections. If a user edits an extracted price or item name, we log the diff. Current metrics (30-day rolling):

Price extraction: 94.1% correct (no edit needed)
Item name extraction: 87.3% correct
Total amount: 96.8% correct

Item names lag because of abbreviations ("Tom" vs. "Tomatoes") and OCR errors on low-contrast ink. We don't auto-correct these; users fix them in-app, and we use the corrections to build a domain-specific dictionary ("Tom" → "Tomatoes" for grocery context). After six months, this lifted item accuracy by 4.2 points.

Lessons and Trade-offs

Model selection: We evaluated Tesseract, EasyOCR, PaddleOCR, and TrOCR. Tesseract was fastest but worst on Arabic. PaddleOCR was a close second to TrOCR but lacked transformer-based context ("l" vs. "1" disambiguation). TrOCR's bidirectional attention resolved ambiguities better, worth the 40% latency penalty.

On-device vs. server: Running TrOCR on-device (ONNX Runtime Mobile, quantized INT8) was feasible on flagship phones (iPhone 13+, Pixel 6+) with ~4s latency. But 60% of our users have mid-range devices where latency ballooned to 12s+ and drained battery. Server-side inference is more equitable and lets us iterate models without app updates.

Rule-based parsing: An end-to-end model (LayoutLM, Donut) could learn spatial relationships, but training requires thousands of fully annotated receipts (bounding boxes + labels). We didn't have that budget. Rules are brittle but cheap to iterate. When a new receipt format breaks parsing, we add a heuristic in 20 minutes. An ML approach would need retraining.

Error handling: We show users a preview of extracted data before saving. Confidence scores below 0.85 are highlighted in yellow. Users correct ~12% of receipts, mostly item names. This human-in-the-loop pattern is essential; fully automated extraction would erode trust.

Future Directions

Next steps: fine-tune a LayoutLM variant on our now-large corpus of corrected receipts (80,000+ with user edits). Initial experiments show 3-point accuracy gains on item-price association. We're also exploring on-device inference for the detection stage only (CRAFT is lightweight) to reduce server load, keeping TrOCR server-side.

For real-time pricing apps, OCR is a means, not an end. The architecture must balance accuracy, cost, latency, and debuggability. In production, hybrid approaches—mixing classical CV, transformers, and rules—often outperform pure ML when data and compute budgets are constrained.