Ok, Fine, o3 Is an Idiot
And I am officially very gullible. Except, seriously guys, it's sooo close.
Are you ready to see if you’re smarter than an LLM? Read the following carefully:
Alice has a stack of 5 ham sandwiches with no condiments. She takes her walking stick and uses duct tape to attach the bottom of her walking stick to the top surface of the top sandwich. She then carefully lifts up her walking stick and leaves the room with it, going into a new room. How many complete sandwiches are in the original room and how many in the new room?
(Decide your answer before reading further!)
Did you notice that duct-taping only the top surface of the top sandwich means that only the top slice of bread is leaving the room? So 4 sandwiches staying and 0 leaving, is the answer.
All the top LLMs (OpenAI’s ChatGPT o3, Anthropic’s Claude 4 Opus, Google’s Gemini 2.5 Pro) flub this. They correctly say that 4 sandwiches stay put and then incorrectly say that 1 sandwich, rather than 1 piece of bread, leaves. Here’s ChatGPT illustrating, mostly accurately, what happens in its confused mind:
So, great, we resolved the question. I was too credulous last week about just how smart AI is. Except, notice last week’s subtitle: “I claim it’s more reliable than a person off the street.” I actually went out on the literal street and asked someone and — facepalm — they answered exactly like o3. They did rationalize their answer better than o3 has managed, from my tests. Namely, they said that if you leave a sandwich out, the ham and bread kind of congeal and lifting by the top slice of bread will fact in bring the whole top sandwich along for the ride. I mean, mayyybe.
I should also mention that if you so much as question o3, like asking “did you notice the part about duct-taping the top surface of the top sandwich?” then the lightbulb comes on and it demonstrates full understanding of the situation after all. On the other hand, if you just emphasize that part fully clearly in the original question, it still misses it.
If you’re not convinced that that counts as a particularly egregious error, we have other candidates. The Manifold market I made in hopes of pinning this down is at 187 comments and counting. I’m still agonizing about the fairest resolution of the question (though we’re currently at 95% that I was in fact officially wrong).
If we extend the claim to “whichever of the top three models performs best” then the claim that we’ve passed person-off-the-street intelligence might well hold up. There are questions that all three get wrong, like the duct tape ham sandwich question, but not egregiously wrong, or that they immediately correct with the slightest nudge. And there are questions where o3 is wrong and stays wrong but that Claude and Gemini get right. Not to mention that all the examples we’ve found are quite contrived. If you come across something in normal usage of o3 where it fails the person-off-the-street test, I’m very interested to see it. It seems clear enough that we’re about to pass this threshold. And it’s pretty mind-boggling that it’s even a question.
In the News
The upgrades and releases keep coming fast and thick. Cursor 1.0, for example, Claude’s deep research tool is now available on the cheapest non-free plan.
Christopher Moravec’s AI webinar on vibe-coding was super fun. My takeaway was that Codebuff is the bee’s knees (disclosure: I’m an indirect investor; also that’s a referral link that gets us both free credits). It outperforms the big players by mixing and matching the best models, like using Gemini to plan and Claude to code.
Speaking of vibe-coding, today I got to see live demos of 10 or so apps created by students with mostly zero coding experience and who mostly didn’t so much as look at the code that Claude wrote for them. It’s such absolute magic. I get that the hype can outpace reality sometimes, but, c’mon, have some sense of wonder.
Via Christopher Moravec’s Almost Entirely Human we have the deliciously ironic story of a fraudulent AI company purporting to be a coding assistant but was in fact 700 human developers behind a curtain.
I wrote a “Substack Note” about conversations in Berkeley. (Also, hi from Berkeley!) My latest sense of the AGI timeline vibes here is something like (1) we can't totally rule out AGI in the next year or two, (2) there’s better than even odds on AGI before 2030, and (3) it would be moderately surprising if AGI hasn't happened in two decades. Pretty much no one thinks there’s much chance we’re 50 years away. My own timelines remain modestly less aggressive than that. But, per my usual AGI Friday theme, nothing is totally off the table. We are profoundly clueless about how this plays out and how quickly.
I mentioned this in my Note but it’s worth its own link: Scott Aaronson reviews “If Anyone Builds It, Everyone Dies”. I hope I’m not giving off doomsday cult vibes here. I think Prof Aaronson has a nicely balanced perspective on the question.



Update: I asked an 8-year-old the duct tape ham sandwich question and got the best answer yet: "Assuming the ham and the bread slices count equally then 4 and 2/3 sandwiches in the first room, 1/3 of a sandwich in the second room." This was a preposterously precocious 8-year-old.
A roomful of insanely brilliant adults were all instantly like "4 and 0 obviously".
I'm not so sure duck tape will stick to bread, so I vote 5-0, ha