We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answers.
We evaluated them on the AIME 2025 I competition from *yesterday* and the results are good!
Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.
Can LLMs actually solve hard math problems? Given the strong performance at AIME, we now go to the next tier: our MathArena team has conducted a detailed evaluation using the recent 2025 USA Math Olympiad. The results are… bad: all models scored less than 5%!
After many requests, we’ve evaluated Grok 3 on the USAMO 2025. The results are in: Grok 3 is tied with DeepSeek-R1 for the second place, earning 4.76% of the total points!
🐋Another extremely impressive release by the @deepseek_ai team. The new DeepSeek-Prover-V2 is the best formal theorem-proving model which significantly outperforms all other closed and open-source models. Method: SFT for cold start, followed by RL
o3-mini is really impressive model, solving 78% of the problems (pass@1 evaluated with 4 repetitions) at very low cost. DeepSeek-R1, the leading open source reasoning model, achieves 65%, and it's distilled variants also do well.
More broadly, we are excited to launch matharena.ai as a platform to evaluate LLMs on latest match competitions and olympiads! This is the only way to ensure problems are not in the training set, and that we can truly measure generalization.
Grok 3 Mini model from @xai is the latest addition to our MathArena leaderboard - it takes 3rd place overall and the most impressive thing about it is extremely low cost per solved problem
Congrats to @GoogleDeepMind on an impressive USAMO score! Exciting to see our MathArena benchmarks being adopted by frontier labs for evaluating mathematical reasoning.
2.5 Pro Deep Think gets an impressive score on 2025 USAMO, currently one of the hardest math benchmarks.
It also leads on LiveCodeBench, a difficult benchmark for competition-level coding, and scores 84.0% on MMMU, which tests multimodal reasoning. #GoogleIO
Latest MathArena update: Qwen3-235B-A22B from @Alibaba_Qwen is the best open source model on MathArena as of right now, improving over DeepSeek-R1 by 14% in the overall table
We will evaluate AIME 2025 II next week, and many competitions after that so stay tuned! Great work by the entire team at @ETH_en and @INSAITinstitute: @JasperDeko56807, @ni_jovanovic, Ivo Petrov, @mvechev