My o1-based AI programming agent is now state of the art on SWE-Bench Verified! It resolves 64.6% of issues.
This is the first fully o1-driven agent we know of. And we learned a ton building it.
How it works:
• o1 with reasoning_mode high for all agent step and editing logic
• a gpt4o based memory component that compresses the agent’s step history
• a custom built python code editor toolset designed to efficiently use model context
• the ability to register
New result for my pure o1-based agent: 57.4% pass@1 on SWEBench-Verified!
Avg cost: $7.5 per instance
Avg time: 13.5 minutes per instance
Pass@3 is 67.8%. Now I'm working on "test time compute scaling", ie combining/choosing the best trajectories, to push closer to this mark.
I'm very excited to announce Weave, our new tools to track and evaluate your LLM apps.
Use Weave to:
🍩log and version LLM interactions and surrounding data, from development to production
🍩experiment with prompting techniques, model changes, and parameters
🍩evaluate your
o1 is a different beast. Its better at doing exactly what you say. Its better at solving hard coding problems. And the advice others have given to specify the outcome you want and give it room to operate is spot on.
For readers, there were just more than 2000 people in a Twitter space for 1 hour, with @iruletheworldmo promising to speak, many well-respected folks in the space. 🍓 did not speak. Conclusion: do not waste your time.
And I built a new typescript-based agent framework called phaseshift that's deeply integrated with Weave. I'm excited to polish it up and release it to the world!
Today we announced that we are being acquired by @CoreWeave, the AI Hyperscaler. 🪄🐝
We could not be prouder or more excited to join forces with this team.
Our CEO, @l2k, wrote a blog post with more details:
wandb.ai/wandb/wb-annou…