Chinese AI models are ~8 months behind and falling further behind

(twitter.com)

2 points | by enraged_camel 8 hours ago ago

6 comments

$ilia-a 7 hours ago

That doesn't seem right and seems to miss GLM 5.1 and Kimi 2.6. Not to mention there is the whole argument of cost/value that Chinese OSS models have vs GPT/Claude.
$giardini 8 hours ago

No problem: they're always at most just one theft away from you!8-)
$tokkkie 8 hours ago

chinese models feel strong in japan — kanji. but outside language? maybe ... max sonnet 4.5 level.
do benchmarks reflect that gap in english region?
$allears 8 hours ago

Not everybody needs cutting-edge performance. Cost per token is turning out to be more important.

$ollin 7 hours ago

The source here is "CAISI Evaluation of DeepSeek V4 Pro" [1]; the US NIST ran their own benchmarks (including several internal ones) and reported the following table:

    | Domain               | Benchmark              | Model (reasoning level) |                             |                          |                       |
    |--:-------------------|------------------------|-------------------------|-----------------------------|--------------------------|-----------------------|
    |                      |                        | OpenAI GPT-5.5 (xhigh)  | OpenAI GPT-5.4 mini (xhigh) | Anthropic Opus 4.6 (max) | DeepSeek V4 Pro (max) |
    | Cyber                | CTF-Archive-Diamond    | **71%**                 | 32%                         | 46%                      | 32%                   |
    | Software Engineering | SWE-Bench Verified*    | **81%**                 | 73%                         | 79%                      | 74%                   |
    |                      | PortBench              | **78%**                 | 41%                         | 60%                      | 44%                   |
    | Natural Sciences     | FrontierScience        | **79%**                 | 74%                         | 72%                      | 74%                   |
    |                      | GPQA-Diamond           | **96%**                 | 87%                         | 91%                      | 90%                   |
    | Abstract Reasoning   | ARC-AGI-2 semi-private | **79%**                 | –                           | 63%                      | 46%                   |
    | Mathematics          | OTIS-AIME-2025         | **100%**                | 90%                         | 92%                      | 97%                   |
    |                      | PUMaC 2024             | **96%**                 | 93%                         | 95%                      | **96%**               |
    |                      | SMT 2025               | **99%**                 | 92%                         | 94%                      | 96%                   |
    | IRT-Estimated Elo    | **IRT-Estimated Elo**  | **1260 ± 28**           | 749 ± 46                    | 999 ± 27                 | 800 ± 28              |

Notably, two of the benchmarks with the biggest capability gap are CAISI-internal/private ones (CTF-Archive-Diamond, PortBench). I read this as "DeepSeek is well-tuned for public benchmarks, and less generally intelligent than GPT5.5 on held-out tasks" but a less-charitable reading would be "US government reports US models do best on benchmarks that only the US government can run". Agent benchmarking is fraught with peril [2] and an impartial benchmarker (who disproportionately overlooks bugs/issues in their evaluation of certain models) can absolutely tilt the scales, so I would not be surprised if a PRC-led benchmarking of frontier models came to the opposite conclusion.

[1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...

[2] https://epoch.ai/gradient-updates/why-benchmarking-is-hard

$jqpabc123 8 hours ago

Chinese models are cheaper and likely to remain so due to lower energy costs.