ZeroBench includes 100 manually-curated multi-step visual reasoning questions
Questions are curated adversarially
They span both natural and synthetic images
2/6
We evaluate 20 LMMs on our benchmark, finding all models to score 0% pass@1 (temperature=0) and 0% 5/5 reliability
Several questions are tantalisingly close to current capabilities, with some models correctly answering them in a pass@5 setting
3/6
If you spot a new issue with any of the ZeroBench questions, please let us know here: docs.google.com/document/d/1qd…
(more details for our dataset refinement strategy to come shortly)
4/6
We need you, eagle-eyed folks of X!
Help us red team ZeroBench to find errors
To recognise effort, we will offer co-authorship to those who find new issues
Details below
1/5
🚨 ZeroBench: We created the HARDEST visual reasoning benchmark we could—then invited the AI community to red team our data 🔥
One month later, here's what happened... 🧵👇
🎉📢New Paper!
Introducing GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
grab-benchmark.github.io
The highest-performing model scores just 21.7%
A thread 🧵