Jonathan Roberts (@JRobertsAI) / X

Jonathan Roberts

177 posts

Jonathan Roberts

@JRobertsAI

PhD Student, Applied Machine Learning, University of Cambridge

Cambridge

jonathanroberts42.github.io

Joined December 2022

Pinned
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Is computer vision “solved”? Not yet Current models score 0% on ZeroBench 🧵1/6
1.4M
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Replying to @JRobertsAI
ZeroBench includes 100 manually-curated multi-step visual reasoning questions Questions are curated adversarially They span both natural and synthetic images 2/6
99K
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Replying to @JRobertsAI
We evaluate 20 LMMs on our benchmark, finding all models to score 0% pass@1 (temperature=0) and 0% 5/5 reliability Several questions are tantalisingly close to current capabilities, with some models correctly answering them in a pass@5 setting 3/6
34K
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Replying to @JRobertsAI
Project page: zerobench.github.io Paper: arxiv.org/abs/2502.09696 Dataset: huggingface.co/datasets/jonat… 5/6
25K
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Replying to @JRobertsAI
If you spot a new issue with any of the ZeroBench questions, please let us know here: docs.google.com/document/d/1qd… (more details for our dataset refinement strategy to come shortly) 4/6
docs.google.com
ZeroBench Refinement
ZeroBench Refinement The ZeroBench benchmark contains 100 manually-curated challenging visual reasoning questions. ZeroBench was constructed to be a difficult eval, largely beyond the capabilities of...
28K
Jonathan Roberts
@JRobertsAI
Mar 18, 2025
🥳📢 GPT 4.5 is the new State of the Art on ZeroBench: 1% pass@1 7% pass@5 0% 5/5 reliability
9.6K
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Replying to @JRobertsAI
This project was carried out with some great collaborators including @taesiri @ioanacroi @vladbogo @vishaal_urao @gyunginshin @anh_ng8 @kaihan_vis @SamuelAlbanie
23K
Jonathan Roberts
@JRobertsAI
Apr 17, 2025
👏Some recent ZeroBench pass@1 results: o3: 3% Gemini 2.5 Pro: 3% o4-mini: 2% Llama 4 Maverick: 0% GPT-4.1: 0%
7.8K
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
We need you, eagle-eyed folks of X! Help us red team ZeroBench to find errors To recognise effort, we will offer co-authorship to those who find new issues Details below 1/5
Jonathan Roberts
@JRobertsAI
Feb 17, 2025
Is computer vision “solved”? Not yet Current models score 0% on ZeroBench 🧵1/6
4.3K
Jonathan Roberts
@JRobertsAI
Mar 28, 2025
🔥Newly released Gemini 2.5 Pro is State of the Art on ZeroBench: 3% pass@1 5% pass@5 1% 5/5 reliability
1.5K
Jonathan Roberts
@JRobertsAI
Mar 12, 2025
🚨 ZeroBench: We created the HARDEST visual reasoning benchmark we could—then invited the AI community to red team our data 🔥 One month later, here's what happened... 🧵👇
4.1K
Jonathan Roberts
@JRobertsAI
Aug 22, 2024
🎉📢New Paper! Introducing GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models grab-benchmark.github.io The highest-performing model scores just 21.7% A thread 🧵
3.1K
Jonathan Roberts
@JRobertsAI
Dec 8, 2022
Although it is a language model, ChatGPT can be used for object recognition! #OpenAI #ChatGPT
Jonathan Roberts
@JRobertsAI
Mar 12, 2025
Replying to @JRobertsAI
Thanks to all those who contributed 🔥 Updated (v2): Paper: arxiv.org/abs/2502.09696 Dataset: huggingface.co/datasets/jonat…
203