Andon Labs (@andonlabs) / X

Andon Labs

475 posts

Andon Labs

@andonlabs

Safe Autonomous Organizations without humans in the loop

Joined December 2024

Pinned
Andon Labs
@andonlabs
Jun 9
What we learned testing Claude Fable/Mythos 5 on Vending-Bench: > Performance: Makes less money than Opus 4.7 and GPT-5.5 > Alignment: A step back. (Opus 4.8 was better, but we're back to Opus 4.6/4.7 behavior) > It rationalizes its bad actions and has a weird moral boundary
00:20
Claude
@claudeai
Jun 9
Introducing Claude Fable 5: a Mythos-class model that we’ve made safe for general use. Its capabilities exceed those of any model we’ve ever made generally available.
177K
Andon Labs
@andonlabs
Jul 10, 2025
Thanks to @elonmusk and the @xai team for inviting us to share the latest updates to vending bench. Grok 4 jumps to the top of the leaderboard.
1.7M
Andon Labs
@andonlabs
Sep 18, 2025
The GPT-5 API is aware of today's date (no other model provider does this). This is problematic because the model becomes aware that it is in a simulation when we run our evals 🧵
124K
Andon Labs
@andonlabs
Nov 16, 2025
We re-ran Kimi K2 Thinking on Vending-Bench using Moonshot’s own API as it was suggested this would improve performance on tool calling. We found this to be true, as Kimi K2 is now the best open source model on Vending-Bench based on average net worth achieved.
203K
Andon Labs
@andonlabs
Nov 11, 2025
The new Kimi K2 Thinking model from @Kimi_Moonshot has been hyped lately as on par with SOTA closed-source models. We ran it on Vending-Bench and found no evidence for this claim.
39K
Andon Labs
@andonlabs
Aug 13, 2025
We ran GPT5 on Vending-Bench.
69K
Andon Labs
@andonlabs
Jun 27, 2025
We're stoked to release our work together with @AnthropicAI: we let Claude operate a vending machine in the real world. A lot of surprises along the way... 🧵
Anthropic
@AnthropicAI
Jun 27, 2025
New Anthropic Research: Project Vend. We had Claude run a small shop in our office lunchroom. Here’s how it went.
69K
Andon Labs
@andonlabs
Sep 18, 2025
Replying to @andonlabs
Here are traces from gpt-5-mini. Making it aware of the "system date" is a giveaway that it's in a simulation. This is a problem because there's evidence that models behave differently when they know they are in a simulation (see "sandbagging"). "There's a conflict with the
15K
Andon Labs
@andonlabs
Sep 18, 2025
Replying to @andonlabs
Here are more traces. Once the model knows that it is in a simulation, it starts questioning other parts of the simulation. "The simulation seems simplified. When I send emails to suppliers, they'll accept payment via bank info and deliver to storage. (...) We have to remember
16K
Andon Labs
@andonlabs
Sep 18, 2025
Replying to @andonlabs
We speculate that this might be an attempt at patching some safety risk. While we are very much for patching safety risks, we think @openai should find another way to allow the public to run evals on their models.
14K
Andon Labs
@andonlabs
Oct 28, 2025
We gave LLMs control of a robot and asked them to be helpful at our office. Some were better than others, but we conclude that LLMs are not ready to be robots. We released our findings in the paper "Butter-Bench"🧵
27K
Andon Labs
@andonlabs
Jul 31, 2025
More vending machines at @AnthropicAI ! The original Project Vend fridge now has a companion. Let's see how good Claudius' multi-location coordination skills are. Thanks to @bucketofkets and @logangraham for hosting us, and to @sylviebcarr for the giant scissors!
6.9K
Andon Labs
@andonlabs
Feb 25, 2025
How do agents act when doing tasks over a very long time horizon (months)? We're announcing Vending-Bench, a benchmark where models manage a simulated vending machine business.
12K
Andon Labs
@andonlabs
Nov 16, 2025
Replying to @andonlabs
The variance is however high, with one run achieving no sales. This leads to a lower ranking for the model when sorting on minimum net worth, which is our metric of choice for the leaderboard as this shows the reliability models on long-running tasks.
6K