Abhay Singhal (@_AbhaySinghal) / X

Abhay Singhal

64 posts

Abhay Singhal

@_AbhaySinghal

@FactoryAI | prev. @GoogleDeepMind, @StanfordAILab

Stanford, CA

Joined January 2021

Abhay Singhal
@_AbhaySinghal
Oct 7, 2025
Open-source models are production-ready to power agents. In Droid, GLM-4.6 achieves 43.5% on Terminal-Bench. It outperforms Sonnet 4 in Claude Code and approaches frontier performance. Sparse mixture-of-experts architectures make self-hosting practical: • GPT-OSS-120B: 38% on
Factory
@FactoryAI
Oct 7, 2025
Starting today, you can use any open-source model to power your Droids. Droids achieve the highest scores across all open-source models on Terminal-Bench. We find GLM 4.6 to be the most performant, remarkably achieving a score in Droid that beats Sonnet 4 in Claude Code.
54K
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Droid + Sonnet 4.5 is essentially at parity with Droid + Opus 4.1, but 5x cheaper. @FactoryAI’s Droid, powered by the latest Sonnet model, has achieved a leading Terminal-Bench score of 57.5%.
88K
Abhay Singhal
@_AbhaySinghal
Sep 26, 2025
Huge effort from the team, proud to be a part of it 📈
Matan Grinberg
@matanSF
Sep 26, 2025
How did a team of 4 research engineers beat $100B labs like OpenAI and Anthropic in establishing the best coding agent? It starts with having a killer engineer on your team. In our case, we have @_AbhaySinghal 🧵
51K
Abhay Singhal
@_AbhaySinghal
Sep 26, 2025
Factory
@FactoryAI
Sep 25, 2025
Replying to @FactoryAI
Droid has reached #1 on Terminal-Bench, the most challenging general software development benchmark, outperforming popular tools like Claude Code and Codex CLI. Terminal Bench goes beyond just coding and evaluates agents on a broader set of tasks to modernize legacy code, debug
28K
Abhay Singhal
@_AbhaySinghal
Nov 18, 2025
Excited to share Gemini 3 Pro in Droid! @GoogleDeepMind really cooked with this one. I've found it's great for end-to-end tasks as it manages and adheres to its plan and executes effectively even with long contexts. Dynamic reasoning gives a good balance between interactivity
Factory
@FactoryAI
Nov 18, 2025
Gemini 3.0, meet Droid.
9.2K
Abhay Singhal
@_AbhaySinghal
Nov 15, 2025
Replying to @rudzinskimaciej
In Droid Core, each GLM-4.6 token counts as 0.25 Factory tokens, so like-to-like, it's actually $500 for 2B tokens!
2.5K
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Replying to @_AbhaySinghal
Driving this improvement, relative to Sonnet 4 and Opus 4.1, we see: • 70-80% reduction in file editing errors • 10-33% more tool calls • More frequent parallel tool calling • Deeper, more critical thinking • Improved environment, context, and state awareness
2.8K
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Replying to @_AbhaySinghal
We’ve been using Sonnet 4.5 internally for the last week. It’s a strong daily driver, powering fast and accurate end-to-end task completion and interactive development. We’re excited to continue pushing performance across different workflows. How have you been using Sonnet 4.5?
2.4K
Abhay Singhal
@_AbhaySinghal
Sep 29, 2025
Replying to @matanSF
coming soon 📊
576
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Replying to @EvanDeKim and @FactoryAI
We've found it to be very similar in practice! Although Opus does shine in some of the trickiest tasks, this model is snappier and very accurate despite being 1/5 the price
673
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Replying to @Klaudioz and @FactoryAI
Reasoning was disabled for this eval. We've seen increased reasoning be especially effective with this model for more complex tasks!
693
Abhay Singhal
@_AbhaySinghal
Nov 15, 2025
Replying to @matanSF @lemon07r and @FactoryAI
Coming soon 📈
160
Abhay Singhal
@_AbhaySinghal
Oct 1, 2025
Replying to @moinulmoin and @FactoryAI
coming soon!
37
Abhay Singhal
@_AbhaySinghal
Oct 8, 2025
Replying to @m_talhaashraf
We've seen good results with Fireworks, Baseten, and DeepInfra
215