That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude.
The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:
Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.
That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude.
No problem: they're always at most just one theft away from you!8-)
chinese models feel strong in japan — kanji. but outside language? maybe ... max sonnet 4.5 level.
do benchmarks reflect that gap in english region?
Not everybody needs cutting-edge performance. Cost per token is turning out to be more important.
The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:
Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...
[2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard
Chinese models are cheaper and likely to remain so due to lower energy costs.