You snap a photo of a receipt. Three seconds later, an app has somehow extracted the merchant name, line items, total, tax, and category. Most product marketing waves vaguely at "AI" and moves on. The actual pipeline is more interesting — and more limited — than the marketing makes it sound.
Here is what is happening between "you press the button" and "the receipt appears in your dashboard."
Step 1 — Image preparation
Before anything else, the app processes the image. Modern phone cameras produce 12-megapixel files weighing several megabytes. Sending that to a model would be slow and expensive, and the extra resolution does not help — the model can read a receipt fine at much lower resolution.
Most apps resize the image to around 1000 pixels on the long edge, strip EXIF metadata (which includes GPS coordinates — important for privacy), and re-encode it as a smaller JPEG. The result is small enough to send fast and clear enough for the model to read.
This step is also where some apps quietly fail at privacy. Stripping EXIF should be the default, not a paid feature. If you are using a receipt scanner that does not strip location metadata, every receipt in your archive carries the exact GPS coordinates of where you took the photo.
Step 2 — The vision model
The processed image goes to a vision-capable language model — these days, that is usually Claude, GPT, or Gemini in their multimodal versions. The model receives the image plus a structured prompt that tells it what to extract and how to format the output.
This is the part of the pipeline that has improved dramatically over the last two years. As recently as 2023, this would have been an OCR engine — optical character recognition — that read the text off the receipt, followed by a separate rule-based system that tried to figure out which bits were merchant, line items, and total. OCR worked, but rigidly: change the receipt format and the rules broke.
Modern vision-language models do both jobs at once. They read the image (the OCR part) and infer the structure (the merchant / items / total / tax / category part) using semantic understanding rather than positional rules. The result is that they generalize to receipt formats they have never seen before, in languages they barely know, with merchants whose name layouts vary wildly.
Step 3 — Structured extraction (JSON output)
The model is asked to return its extraction as JSON — a specific shape with named fields. That structure matters for two reasons.
First: it forces the model to commit to specific values. "What is the merchant name?" gets answered as a string, not as a paragraph of hedging. Second: it makes the downstream code trivial — the app just parses the JSON and saves it to the database.
{
"merchant": "Whole Foods Market",
"date": "2026-05-15",
"total": 14.45,
"tax": 1.16,
"category": "Groceries",
"items": [
{"name": "Organic Bananas", "price": 3.29},
{"name": "Sourdough Loaf", "price": 5.99},
{"name": "Cold Brew 32oz", "price": 4.49}
],
"confidence": 0.94
}The confidence field is the model's own self-rating of how sure it is about the extraction. That number is the gate for the next step.
Step 4 — Confidence threshold and re-run
Most well-designed receipt scanners do not trust a single model run. They set a confidence threshold — typically around 0.7 — and if the first extraction comes back below it, they re-run the request against a more capable (and more expensive) model.
The reason is economics. The smaller, faster, cheaper model handles 90% of receipts perfectly well. The harder 10% — wrinkled paper, unusual layouts, multilingual receipts — gets handed to the more powerful model that costs 5–10x as much per request. You pay the higher price only on receipts that need it.
SnapLedge's default behavior follows this pattern: most receipts go through Claude Haiku for speed and cost, but anything that comes back below the 0.7 threshold gets re-run through Claude Sonnet. The user sees only the result — but the system has quietly done the right thing under the hood.
Where AI receipt scanning fails
The pipeline is good, not magic. Specific failure modes show up reliably:
- Handwritten amounts. The model can read printed totals reliably; handwritten ones (waiter writing the tip on the credit card slip) often get misread.
- Faded thermal paper. Receipts from thermal printers fade over months, especially in hot environments. Past a certain fade level, the model is reading guesses.
- Multilingual receipts with non-Latin scripts. The model handles English, most European languages, and major Asian languages reasonably well. Mixed-script receipts (Chinese item names, Latin merchant name, Arabic numerals) can confuse the field extraction.
- Unusual layouts. A receipt formatted as a single dense paragraph instead of a structured list (some restaurants do this) makes line-item extraction unreliable.
- Reflective or partial photos. Glare from a phone flash, or a photo that cuts off half the receipt, will produce confident-looking output that is partly wrong.
None of these is a fatal flaw. They are predictable failure modes. The right response is design-level: surface the model's confidence number, give users a one-tap way to correct extractions, and let the system learn from corrections.
When to trust the extraction
A reasonable default: trust the model when confidence is high (above 0.85) AND the receipt is the kind of receipt the model handles well (printed, English, standard layout). Verify by sampling — check one in every ten extractions for the first month — until you have built your own intuition for where the system is reliable and where it is not.
For the receipts in the failure modes above: do not blindly trust. The model will return an answer; it will sound confident; it may be partly wrong. Glance at the totals, glance at the merchant. Two seconds of verification beats six months of subtly wrong data.
The bigger picture
AI receipt scanning is not a magic box; it is a pipeline with measurable steps and predictable failure modes. The best receipt-scanning apps will not market themselves as having "solved" the problem — because they have not — but they will be specific about where the system works, where it fails, and how they catch the failures.
SnapLedge's approach is to be explicit about all of this. The confidence threshold is configurable; the fallback to the more capable model is automatic; and the user always sees the extracted data before it is final — meaning the worst case is that you correct a wrong field in two seconds, not that you find out about a wrong field six months later.