EvoClaw:Evaluating AI Agents on Continuous Software Evolution

NEW Claude Opus 4.7 (xhigh, 200K context) leads the overall leaderboard at 39.81%.
NEW GPT-5.5 (xhigh) takes the #2 official spot at 37.77%.
NEW Kimi K2.6 is the best open-source model at 34.69%, but uses the most turns; GLM-5.1 ranks #2 open-source at 28.77% with about half the turns.

Long-running agents build customized software (a “Claw”) to interact with their environments. For practical use in complex, real-world tasks, these agents must fully and autonomously evolve this software in response to a continuous stream of end-user requirements. EvoClaw evaluates how well frontier LLM agents handle this continuous development, benchmarking them against real-world evolution itineraries from open-source repositories.

Overall Cost / Performance on EvoClaw

Leaderboard

Best 2nd Worst
# Model Agent Score (%) Precision (%) Recall (%) Resolve (%) Cost ($) Out Tok. (K) Time (h) Turns