Pinned
Spent $11K evaluating Claude Fable 5 on @WolfBenchAI.
It had #1 potential, but outright refusals dragged its final score down.
Surprising result: Fable does not even surpass Opus 4.6.
We spent $11,081.12 evaluating @AnthropicAI's Claude Fable 5 on WolfBench.
Our most expensive benchmark yet.
And it did not even top the charts.
Not because it lacked capability, but because it kept refusing.
Details in thread: 🧵



















