Mislav Balunović (@mbalunovic) / X

Mislav Balunović

138 posts

Mislav Balunović

@mbalunovic

Research Scientist @GoogleDeepMind - Gemini Thinking

London

Joined June 2010

Pinned
Mislav Balunović
@mbalunovic
Feb 7, 2025
We finally have an answer to the debate over whether LLMs generalize to new math problems or they merely memorized the answers. We evaluated them on the AIME 2025 I competition from *yesterday* and the results are good!
353K
Mislav Balunović
@mbalunovic
Apr 2, 2025
Big update to our MathArena USAMO evaluation: Gemini 2.5 Pro, which was released *the same day* as our benchmark, is the first model to achieve non-trivial amount of points (24.4%). The speed of progress is really mind-blowing.
304K
Mislav Balunović
@mbalunovic
Apr 17, 2025
And we have our first fully green row on MathArena - o4-mini-high completely solves AIME 2025 II, marking the benchmark officially saturated!
49K
Mislav Balunović
@mbalunovic
Mar 25, 2025
Can LLMs actually solve hard math problems? Given the strong performance at AIME, we now go to the next tier: our MathArena team has conducted a detailed evaluation using the recent 2025 USA Math Olympiad. The results are… bad: all models scored less than 5%!
96K
Mislav Balunović
@mbalunovic
Apr 5, 2025
After many requests, we’ve evaluated Grok 3 on the USAMO 2025. The results are in: Grok 3 is tied with DeepSeek-R1 for the second place, earning 4.76% of the total points!
169K
Mislav Balunović
@mbalunovic
Apr 30, 2025
🐋Another extremely impressive release by the @deepseek_ai team. The new DeepSeek-Prover-V2 is the best formal theorem-proving model which significantly outperforms all other closed and open-source models. Method: SFT for cold start, followed by RL
16K
Mislav Balunović
@mbalunovic
Apr 21, 2025
The evaluation of the latest @OpenAI models on the proof-based USAMO test is completed - the models take 2nd and 3rd place on the leaderboard
21K
Mislav Balunović
@mbalunovic
Feb 7, 2025
Replying to @mbalunovic
o3-mini is really impressive model, solving 78% of the problems (pass@1 evaluated with 4 repetitions) at very low cost. DeepSeek-R1, the leading open source reasoning model, achieves 65%, and it's distilled variants also do well.
17K
Mislav Balunović
@mbalunovic
Feb 7, 2025
Replying to @mbalunovic
More broadly, we are excited to launch matharena.ai as a platform to evaluate LLMs on latest match competitions and olympiads! This is the only way to ensure problems are not in the training set, and that we can truly measure generalization.
MathArena.ai
From matharena.ai
16K
Mislav Balunović
@mbalunovic
Apr 14, 2025
Grok 3 Mini model from @xai is the latest addition to our MathArena leaderboard - it takes 3rd place overall and the most impressive thing about it is extremely low cost per solved problem
17K
Mislav Balunović
@mbalunovic
May 20, 2025
Congrats to @GoogleDeepMind on an impressive USAMO score! Exciting to see our MathArena benchmarks being adopted by frontier labs for evaluating mathematical reasoning.
Google DeepMind
@GoogleDeepMind
May 20, 2025
Replying to @GoogleDeepMind
2.5 Pro Deep Think gets an impressive score on 2025 USAMO, currently one of the hardest math benchmarks. It also leads on LiveCodeBench, a difficult benchmark for competition-level coding, and scores 84.0% on MMMU, which tests multimodal reasoning. #GoogleIO
5.9K
Mislav Balunović
@mbalunovic
Apr 2, 2025
Replying to @gabrielalon_ai
I think it's basically impossible here as the model was released the same day as our benchmark
5.8K
Mislav Balunović
@mbalunovic
May 1, 2025
Latest MathArena update: Qwen3-235B-A22B from @Alibaba_Qwen is the best open source model on MathArena as of right now, improving over DeepSeek-R1 by 14% in the overall table
5K
Mislav Balunović
@mbalunovic
Feb 7, 2025
Replying to @mbalunovic
We will evaluate AIME 2025 II next week, and many competitions after that so stay tuned! Great work by the entire team at @ETH_en and @INSAITinstitute: @JasperDeko56807, @ni_jovanovic, Ivo Petrov, @mvechev
14K