Wow! 😮 claude-3.5 is an extremely impressive overall model! It achieves the top score in **every category**, and substantially improves in reasoning! See for yourself with our interactive leaderboard: livebench.ai
Nearly all questions in LiveBench are brand new, so there is no contamination, even for claude-3.5. It performs especially well on house_traversal, a spatial reasoning task which is brand new!
🚨🚨Early findings for o1-preview and o1-mini!🚨🚨
(1) The o1 family is unbelievably strong at hard reasoning problems! o1 perfectly solves a reasoning task that my collaborators and I designed for LLMs to achieve <60% performance, just 3 months ago 🤯🤯 (1 / ?)
I am thrilled to join @Caltech as a postdoc working with @AnimaAnandkumar on AutoML for Science! 🔬
Thank you to my amazing colleagues at Abacus.AI for four amazing years, and I can't wait to follow the future accomplishments at Abacus!
"A note on paper length. Expecting more text in this paper? Wondering if it’s a workshop paper we hastily submitted to ICLR? No. This paper presents a simple idea, one where we genuinely believe that a short paper presentation is more effective."
🚨Llama 3.1 405B eval just dropped🚨
🥇 in instruction following
🥈 in reasoning
On par with GPT-4o in math and coding
It’s a great day for the open-source community!!
Full evals on the challenging, contamination-free benchmark ➡️ livebench.ai
Many of the benchmarks Anthropic reported are nearly saturated, with models achieving 88-96% performance. LiveBench is not saturated, so it shows the true improvement of claude-3.5! Stay tuned for next month when we release harder tasks!
🔗: livebench.ai
#ICML2022 in Baltimore, Maryland is the first in-person general ML conference since NeurIPS 2019. July 17 to 23. Save the date!
And, stay a few extra days in Baltimore to check out automl.cc !
GPT-4 was truly a team effort from our entire company, but the overall leadership and technical vision of Jakub Pachocki for the pretraining effort was remarkable and we wouldn’t be here without it
Here are the LiveBench scores for chatgpt-4o-latest!
It ties gpt-4o-2024-05-13, yet gpt-4o-2024-08-06 is still the best GPT model according to livebench.ai!