TLDR: Cheap (and sometimes old) models perform on par, or better than flagship models on standard OCR tasks, at a fraction of the cost.
This conclusion comes from a benchmark we ran on 18 models and over 7k+ LLM calls. Leaderboard and benchmark repo completely open-source.
Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.
So we investigated the topic and open-sourced everything, including a free tool to check your own documents.
We ran 18 models from OpenAI, Anthropic, Google, and Mistral on 42 real-world documents (invoices, receipts, bills of lading, transport orders).
Each model ran 10 times per document to measure reliability, not just one-shot accuracy; 7,560 API calls total.
The finding: for standard document extraction, mid-tier and older models match or beat state-of-the-art, at a fraction of the cost.
In some cases the cost difference is multiple orders of magnitude for equivalent accuracy.
We also track pass^n (how reliability degrades over repeated runs, see tau-bench), cost-per-success (not just cost-per-token),
and critical field accuracy. Full methodology and dataset are open source.
Built by two founders in Antwerp.
Very curious if other people have similar conclusions or if you've seen specific edge cases where the flagships still justify their price tag?
Hi HN,
TLDR: Cheap (and sometimes old) models perform on par, or better than flagship models on standard OCR tasks, at a fraction of the cost. This conclusion comes from a benchmark we ran on 18 models and over 7k+ LLM calls. Leaderboard and benchmark repo completely open-source.
Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.
So we investigated the topic and open-sourced everything, including a free tool to check your own documents.
We ran 18 models from OpenAI, Anthropic, Google, and Mistral on 42 real-world documents (invoices, receipts, bills of lading, transport orders). Each model ran 10 times per document to measure reliability, not just one-shot accuracy; 7,560 API calls total.
The finding: for standard document extraction, mid-tier and older models match or beat state-of-the-art, at a fraction of the cost. In some cases the cost difference is multiple orders of magnitude for equivalent accuracy.
We also track pass^n (how reliability degrades over repeated runs, see tau-bench), cost-per-success (not just cost-per-token), and critical field accuracy. Full methodology and dataset are open source.
Leaderboard: <https://www.arbitrhq.ai/leaderboards/>
Dataset + framework (GitHub): <https://github.com/ArbitrHq/ocr-mini-bench>
Or test your own documents for free: <https://app.arbitrhq.ai/benchmark-free>
Built by two founders in Antwerp. Very curious if other people have similar conclusions or if you've seen specific edge cases where the flagships still justify their price tag?