Arena Challenge 0 is now public!
๐ $6,000 in prizes + MiniMax credits
๐๏ธ May 20 - June 22, 2026
Built on @databricks' enterprise OfficeQA benchmark, the Grounded Reasoning Challenge is now open for everyone.
Open-source AI makes transparency the default, so no single monolith can dictate access, research, or innovation.
Say no to the black box. Thatโs how everyone wins.
Weโre rolling out changes to make Fable 5โs safeguards for frontier LLM development visible.
Starting this week, flagged requests will visibly fall back to Opus 4.8โthe same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged
3/ Build agents that push deep into the hard questions no one else has cracked.
There's one big cliff in the data, and itโs between the top 6 teams and everyone else:
โข Medium: Top 6 solved 86-97% vs Top 7-15 solved 57-80%
โข Hard: Top 6 solved 83-92% vs Top 7-15 solved 17-75%
2/ Build agents that know when to think harder.
Successful agent runs reason significantly more than failed ones, scaling with difficulty:
โข 10% more on easy tasks (The baseline is failing trajectories on the same task)
โข 25% more on medium tasks
โข 35% more on hard tasks
Of
1/ Build agents that know when to stop.
Cohort 0 burned ~$3,300 on inference. 43% of that paid for traces that returned wrong answers.
Agents can't tell when they're off the rails, so they keep generating. Nearly half of every inference dollar went to failure, but better
We analyzed data from our first batch of Arena builders to see what separates the top teams from everyone else.
Here are the three open source AI insights that stood out โ
In Microsoft Research's new SkillOpt paper, EvoSkill is named the โstrongest harness-side competitorโ tested, and the closest system to their own method when run inside Codex and Claude Code agent loops.
The biggest labs in AI are paying attention, and @salahalzubi401 and the