Skip to content

EvoClaw-Bench/EvoClaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

49 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EvoClaw Banner

A Continuous Task Evaluation Playground for AI Harness

Website arXiv HuggingFace Dataset DockerHub License: MIT Python 3.10+


Note

πŸ†• Claude Opus 4.7 (xhigh, 200K context) leads the overall leaderboard at 39.81%.

πŸ†• GPT-5.5 (xhigh, 272K context) takes the #2 official spot at 37.77%.

πŸ†• Kimi K2.6 is the best open-source model at 34.69%, but uses the most turns; GLM-5.1 ranks #2 open-source at 28.77% with about half the turns.

See the leaderboard for more details!

Most existing benchmarks evaluate agents on isolated, one-shot tasks. But real-world workflows are not a bag of independent missions, they are continuous processes where tasks build on each other, dependencies interleave, and context accumulates over a long session.

Independent Coding Task vs. Continuous Software Evolution

EvoClaw is a general-purpose evaluation harness for continuous tasks. It drops an AI agent into a working environment and challenges it to complete an ordered sequence of milestones. As the agent works, EvoClaw silently extracts checkpoints, evaluates each milestone, and asynchronously unlocks downstream tasks, enabling fine-grained, per-milestone analysis without interrupting the agent's session.

How EvoClaw works: Continuous Task Evaluation with DAG, Agent Loop, and Test-Based Grading

Currently focused on software evolution, EvoClaw's architecture is designed to extend to other domains.

✨ Key Features

  • Test Your Model: Out of the box, EvoClaw ships with the EvoClaw Benchmark (long-horizon software evolution itineraries from 7 real-world repos) and 4 pre-configured agent frameworks (Claude Code, Codex, Gemini CLI, OpenHands). Provide a model API key and start evaluating.

  • Bring Your Own Agent: The agent layer is decoupled from the evaluation engine (see below). Plug in your own agent by implementing a lightweight adapter. EvoClaw also provides a per-milestone analysis framework for detailed performance breakdowns.

  • Bring Your Own Data: Supply your own task descriptions, test environments (Docker), test list for scoring, and task dependencies. EvoClaw handles orchestration, checkpoint-based evaluation, and reporting, enabling continuous task evaluation beyond coding.

πŸ‘‹ Overview

EvoClaw Architecture: Orchestrator, Agent Container, DAG Manager, Evaluation Cycle, and Analytics

Each evaluation trial works as follows:

  1. An agent is dropped into a persistent Docker container with a workspace at a given starting state.
  2. It receives a sequence of task specifications describing tasks to achieve.
  3. Tasks are ordered by a dependency DAG---downstream tasks unlock as upstream ones are completed.
  4. The agent signals completion by creating git tags (e.g., agent-impl-milestone_001).
  5. A watcher thread silently detects tags, extracts artifact snapshots, and runs pre-defined automated validation in a separate, one-time task evaluation container.
  6. Results, logs, and outcomes are automatically collected and analyzed per task.

πŸ”§ Setup

0. Prerequisites

  • Python >= 3.10
  • Docker
  • Model API access via environment variables: UNIFIED_API_KEY and UNIFIED_BASE_URL

1. Installation

git clone https://github.com/Hydrapse/EvoClaw.git
cd EvoClaw
uv sync

2. Data & Docker Images

Workspace data is hosted on HuggingFace. Docker images are hosted on DockerHub.

# Download workspace data
git lfs install
git clone https://huggingface.co/datasets/hyd2apse/EvoClaw-data

# Pull all repos at once
./scripts/pull_images.sh

See docs/setup.md for the full data layout, Docker image naming conventions, and manual retag instructions.

πŸš€ Usage

Hand docs/running-trials.md to your agent β€” it has everything needed to launch trials, monitor progress, recover stuck repos autonomously, and manage trial IDs.

1. Configure β€” copy the template and edit:

cp trial_config.example.yaml trial_config.yaml
# modify trial_config.yaml 
data_root: /path/to/EvoClaw-data       # where you cloned the HuggingFace dataset
trial_name: my_experiment              # name for this evaluation run
agent: claude-code                     # agent: claude-code | codex | gemini-cli | openhands
model: claude-opus-4-7                 # model identifier (use claude-opus-4-7[1m] for 1M context)
timeout: 18000                         # optional: max agent runtime per repo (seconds)
# reasoning_effort: high               # optional: low | medium | high | xhigh | max 
# repos: [navidrome, ripgrep]          # optional: run only these repos (default: all)

2. Run β€” evaluate across all repos:

export UNIFIED_API_KEY=sk-...
export UNIFIED_BASE_URL=https://...   # optional, for proxy or custom endpoints
# NOTE: if UNIFIED_BASE_URL is a custom domain, add it to WHITELISTED_DOMAINS
# in harness/e2e/container_setup.py β€” agent containers block all other outbound traffic.
python scripts/run_all.py --config trial_config.yaml

3. Monitor β€” check progress in another terminal:

./scripts/monitor.sh                              # auto-detects trial, compact view
./scripts/monitor.sh my_experiment --detail        # per-milestone breakdown
./scripts/monitor.sh my_experiment --full          # full table with all columns

Tip: run_all.py is fire-and-forget β€” it spawns one detached run_e2e per repo and exits immediately (no nohup needed). Re-running the same command is the resume operation: each worker holds an flock on its trial dir, so the second invocation either takes over only-if-the-first-one-died, or refuses with a clear "owned by PID …" message. Add --force to wipe & restart the latest matching _NNN, or --new to start the next _NNN fresh. See docs/running-trials.md for the full behavior matrix.

See docs/running-trials.md for the day-to-day operational runbook (launch, monitor, recover from stuck repos), and docs/advanced.md for single-repo / single-milestone debugging, result collection, e2e_config.yaml, and lock internals.

πŸ” Troubleshooting

Below are common issues you may encounter when running evaluations, along with solutions.

1. Network access blocked inside containers

Agent containers enforce an iptables-based outbound whitelist β€” only domains needed for API access and package management are allowed (e.g., api.anthropic.com, registry.npmjs.org, pypi.org). Code hosting sites (GitHub, GitLab, etc.) are explicitly blocked to prevent data leakage. If your setup routes API requests through a custom proxy, make sure the proxy domain is included in WHITELISTED_DOMAINS in harness/e2e/container_setup.py.

Port 80 outbound blocked? Some hosts/datacenters block plain HTTP. The harness automatically rewrites Debian/Ubuntu apt sources to HTTPS so apt-get update reaches mirrors via 443 β€” no action needed.

2. Agent stops before all milestones are completed

Directly re-run the same run_all.py command to resume and continues from where the previous worker left off:

python scripts/run_all.py --config trial_config.yaml

Evaluation protocol: The reported EvoClaw benchmark results follow this protocol: trials are resumed until all milestones are submitted and evaluated. Each resume reuses the same agent session by default to preserve the model's memory of prior work; after three consecutive resumes without new submissions, EvoClaw automatically rotates in a fresh session on the next resume so no repo blocks the trial indefinitely. We encourage reproducibility studies to follow the same setting.

⚠️ Not every "no progress" is the agent's fault. Rate-limit (HTTP 429), quota exhaustion, or auth (HTTP 401/403) errors also cause workers to exit with no submissions. Rotating the agent session won't help β€” the next request hits the same error. Check the session jsonl for api_error_status: 429 / "reached your usage limit" / 401; if present, fix the API key (top up or rotate) before resuming.

Resume requires the container to still exist. All of the agent's in-progress work β€” code changes, git history, and Claude's conversation memory β€” lives inside the container, with no copy on the host. If the container is deleted, the whole trial's working state is lost and only --force (restart from scratch) is available. Source snapshots of already-evaluated milestones are preserved on the host under evaluation/, so prior scores survive, and the full agent logs are copied to the host directory when the trial finishes.

3. api_router is currently unstable for long trials

The api_router path (translating Anthropic ↔ OpenAI so Claude Code can talk to third-party models) is not recommended for full benchmark runs today β€” context tends to grow past the upstream's body limit on long sessions and some reasoning-model upstreams reject key Anthropic body fields.

🀝 Contributing

We welcome contributions! Whether it's adding support for new agents, new task domains, new datasets, bug fixes, or documentation improvements.

✍️ Citation

Welcome to cite our paper if you find EvoClaw useful!

@misc{deng2026evoclawevaluatingaiagents,
      title={EvoClaw: Evaluating AI Agents on Continuous Software Evolution},
      author={Gangda Deng and Zhaoling Chen and Zhongming Yu and Haoyang Fan and Yuhong Liu and Yuxin Yang and Dhruv Parikh and Rajgopal Kannan and Le Cong and Mengdi Wang and Qian Zhang and Viktor Prasanna and Xiangru Tang and Xingyao Wang},
      year={2026},
      eprint={2603.13428},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.13428},
}

πŸ“„ License

This project is licensed under the MIT License.

About

A Continuous Task Evaluation Playground for AI Harness

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors