Hi HN - this is a benchmark I developed that tests various models against large samples of text, asking them to find and fix a variety of errors. Its purpose is to evaluate how good models are at proofreading (a common use case of LLMs) and how efficient they are on various axes.
I've been working on this to inform my own decisions about which models to use in my agentic word processor, but I think it's also just useful data.
I just ran GPT 5.5 and it broke Gemini's previous high score of 92.5%!
Hi HN - this is a benchmark I developed that tests various models against large samples of text, asking them to find and fix a variety of errors. Its purpose is to evaluate how good models are at proofreading (a common use case of LLMs) and how efficient they are on various axes.
I've been working on this to inform my own decisions about which models to use in my agentic word processor, but I think it's also just useful data.
I just ran GPT 5.5 and it broke Gemini's previous high score of 92.5%!
The code and run artifacts are available on Github: https://github.com/reviseio/errata-bench