Skip to content

THU-Team-Eureka/EurekAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EurekAgent

Define the problem and metric. Get breakthrough results.

License: AGPL-3.0 arXiv: 2606.13662 Python Claude Code Docker uv Platform Status

Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step.

News · Overview · Quick Start · New Problem · Useful Tips · Results · Contributing · Commercial Licensing · Citation

📰 News

  • 2026/06/13 — EurekAgent has been accepted to the BAAI Agent4S workshop! Join us for our presentation at the BAAI conference on June 13th, 2026 in Beijing. Slides will be available soon.
  • 2026/06/12 — v0.1.0 released!

🔍 Overview

We present EurekAgent, an agent system for metric-driven autonomous scientific discovery. Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step.

demo-EurekAgent-compressed.1.1.mp4

Highlights

  • Environment engineering first — provides strong CLI agents with the resources, constraints, artifacts, budgets, and human interfaces needed for reliable autonomous discovery.
  • End-to-end research loop — proposes approaches, implements code, evaluates submissions, and iterates toward better results.
  • Problem-defined evaluation — uses your INSTRUCTION.md, SUBMISSION_FORMAT.md, and private evaluate.py as the source of truth.
  • Isolated execution — runs agent work and grading in separate Docker containers for secure, sandboxed experiments.
  • Resumable long runs — flexibly interrupt and resume a run from persisted state.
  • User-friendly interfaces — optionally chat with agents through the TUI, and track live cost stats, score evolution, and full session logs in the web monitor.

🚀 Quick Start

1. Install Docker and Node.js 22+

Docker — follow the official guide for your platform. Then add your user to the docker group:

sudo usermod -aG docker $USER
# Check if the user is added to docker group
groups $USER

Node.js 22+ — the agent container is built on the node:22-bookworm image, so install a matching Node.js 22+ runtime on the host as well (from nodejs.org or via nvm) and confirm:

nvm install 22
node --version   # must be v22 or newer

2. Install Claude Code

EurekAgent drives the experiment loop through Claude Code. It runs both on your host (for the /generate-inputs skill and problem authoring) and inside the agent container (preinstalled by the Docker image below).

a) Install Claude Code on the host (requires Node.js 22+ from Step 2):

npm install -g @anthropic-ai/claude-code
claude --version   # sanity check

b) Authenticate and point Claude Code at your model endpoint. EurekAgent forwards these into the agent container, so configure them once in ~/.claude/settings.json under the "env" block:

{
  "env": {
    "ANTHROPIC_AUTH_TOKEN": "YOUR_KEY_HERE",
    "ANTHROPIC_BASE_URL": "YOUR_BASE_URL_HERE",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.1",
    "API_TIMEOUT_MS": "3000000"
  },
  "model": "sonnet"
}

3. Install Python dependencies

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# Clone and enter the project
git clone https://github.com/THU-Team-Eureka/EurekAgent.git && cd EurekAgent

# Install uv-managed Python 3.12
uv python install 3.12.12

4. Pull the base image and build the container

docker pull node:22-bookworm
bash docker/build.sh

Verify the image is available:

docker images | grep eureka-agent

If you are behind a proxy or docker pull fails, see the Docker troubleshooting guide.

5. (Recommended) Configure MCP servers for web access

During a run the agent can search the web for problem context and read live pages. These MCP servers are optional — when absent, the agent falls back to Claude Code's built-in WebSearch. web-search-prime is intended for GLM users; users of other model providers can skip it or configure their preferred search MCP.

a) web-search-prime — structured web search for GLM users only

claude mcp add -s user -t http web-search-prime https://api.z.ai/api/mcp/web_search_prime/mcp --header "Authorization: Bearer YOUR_KEY_HERE"

b) playwright — fetch and read actual webpage content.:

claude mcp add playwright npx @playwright/mcp@latest
npx playwright install chromium        # pre-install the headless browser

EurekAgent ships a Playwright config at .claude/playwright-mcp.json (headless Chromium, sandbox flags, timeouts). It is mounted read-only into the agent container automatically — create or edit that file to match your network (e.g. add a proxy) if needed.

6. Run an example

bash examples/circle_packing/run.sh

🧠 Setting Up a New Problem

You can use the /generate-inputs skill in Claude Code to interactively generate all required files (INSTRUCTION.md, SUBMISSION_FORMAT.md, evaluate.py, run.sh) from a natural language description of your problem. Just type /generate-inputs and follow the prompts.

Each problem lives in its own directory under examples/. You need the following files:

Required Files

File Purpose Required?
INSTRUCTION.md Problem description for the LLM agent Yes
SUBMISSION_FORMAT.md JSON schema for candidates + score semantics Yes
hidden_eval_dir/evaluate.py Private evaluator with grade_submission and is_better Yes
initial.py Starting code for the agent Recommended
run.sh Convenience script to launch a run Recommended

evaluate.py Specification

The evaluator is the single source of truth for scoring and comparison. It must define two functions:

grade_submission(submission_path: str, context: dict) -> dict

Called by the secure grader server to score a candidate submission.

  • Parameters:
    • submission_path: path to the JSON file the agent submitted
    • context: dict with workspace_root, approach_id, metadata
  • Returns a dict with:
    • score (float): the raw objective value. Do NOT negate. Return the value as-is (e.g., the C5 value for a minimization problem, or sum of radii for a maximization problem).
    • valid (bool): whether the submission is valid
    • message (str): human-readable feedback
    • opt_target_met (bool, optional): whether an optimization target was met
    • public_metrics (dict, optional): additional metrics for display
  • Invalid submissions: return a score that can never be "best". Use float("inf") for minimization problems, float("-inf") for maximization, or float("inf") for approach-target problems.

is_better(new_score: float, old_score: float) -> bool

Defines which score is better. Called by the system to compare scores for ranking, best-result tracking, and display.

  • Returns: True if new_score represents a better result than old_score
  • Examples:
    • Minimization: return new_score < old_score
    • Maximization: return new_score > old_score
    • Approach target (e.g., π): return abs(new_score - 3.14159) < abs(old_score - 3.14159)

Both functions are required. The system will fail at startup if either is missing.

INSTRUCTION.md

Must clearly state:

  • The optimization objective and its direction (minimize, maximize, approach target, etc.)
  • Constraints and validation rules
  • Known best results (if any) or target score
  • The contract for the run() function

SUBMISSION_FORMAT.md

Must describe:

  • Required JSON keys and their types
  • Score semantics (e.g., "Score is the raw C5 value. Lower is better.")
  • Invalid submission behavior

run.sh

A convenience script. Must pass at minimum:

  • --problem: path to INSTRUCTION.md
  • --hidden-eval-dir: path to the directory containing evaluate.py
  • --submission-format: path to SUBMISSION_FORMAT.md
  • --model: the model to use
  • Time budget flags: --propose-time-limit-per-session + --implement-time-limit-per-session

Example:

cd "$(dirname "$0")/../.."

uv run python -m src \
    --model glm-5.1 \
    --problem examples/my_problem/INSTRUCTION.md \
    --hidden-eval-dir examples/my_problem/hidden_eval_dir \
    --submission-format examples/my_problem/SUBMISSION_FORMAT.md \
    --initial-code examples/my_problem/initial.py \
    --propose-time-limit-per-session "20 minutes" \
    --implement-time-limit-per-session "120 minutes" \
    --max-num-approaches 3 \
    --max-loops 5 \
    --gpus auto \
    --adapter-mode "pty"

GPU selection defaults to --gpus auto. For CPU-only runs, pass --gpus none. For a Linux NVIDIA server, pass explicit IDs such as --gpus 0,1 if you want to restrict the run to a subset of GPUs.

💡 Useful Tips

Best practices for new problems

  • Design evaluators defensively: consider obvious reward-hacking paths, invalid outputs, hidden-test leakage, tolerance abuse, filesystem side effects, and score tampering.
  • Include the current SOTA, best known score, or target score in INSTRUCTION.md so agents know what result they are trying to beat.

Monitor & snapshots

  • Live monitor — starts automatically in the background during a run (disable with --no-monitor, pick a port with --monitor-port). It prints a Web monitor: http://127.0.0.1:<port> URL you can open in a browser.
  • Static snapshot — when a run finishes, a self-contained monitor_snapshot.html is written into the run directory so you can review it after the server is gone.
  • Historical snapshot — regenerate one fully offline (no eureka process, server, or Docker needed — it reads only from disk):
# Latest run is auto-selected when you point at the runs/ parent:
uv run python -m src.monitor.server --runs-dir runs --snapshot
# ...or a specific run:
uv run python -m src.monitor.server --run-dir runs/<run_id> --snapshot

The snapshot is written into the run's directory as monitor_snapshot.html — open it in a browser.

Docker runtime model

EurekAgent runs in Docker mode by default. Each run uses two containers:

  • Agent container: runs Claude Code sessions and sees the run workspace at /workspace.
  • Grader container: runs the secure evaluation server and also sees /workspace, so it can read submitted files and write official results.

The hidden evaluator directory (hidden_eval_dir) is mounted only into the grader container, read-only, at /hidden_eval. It is not mounted into the agent container, so agent code can submit candidates and receive scores but cannot directly read or modify the private evaluator.

The host/controller uses the project .venv, while containers use a persistent Linux venv under .eureka_docker/venvs/... mounted as /workspace/.venv. Delete .eureka_docker/venvs to force recreation of the container Python environment.

📊 Results

EurekAgent achieves strong results across mathematics, kernel engineering, and machine learning tasks. It sets new state-of-the-art results on all evaluated mathematics and kernel engineering tasks, and ranks first by medal rate on our seven-task MLE-Bench subset. On the three mathematical optimization tasks, each run used less than $17 in API cost.

Domain Task Previous Best AI EurekAgent
Mathematics Circle Packing (↑) 2.635986 2.635999
Mathematics Erdős' Min. Overlap (↓) 0.380876 0.380870
Mathematics 1st Autocorr. Ineq. (↓) 1.502863 1.502861
Kernel Engineering TriMul (↓) 2247.78 μs 2005.03 μs
Machine Learning MLE-Bench subset (↑) 71.43% 85.71%

🤝 Contributing

Contributions are welcome! Whether it's bug reports, feature ideas, or pull requests — every bit helps. For substantial changes, please open an issue first to discuss the design, and keep changes focused, documented, and covered by relevant tests when possible.

We especially welcome contributions for Windows support and additional CLI-agent adapters, such as Codex.

How to Contribute

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

💼 Commercial Licensing

This project is licensed under AGPL-3.0. For commercial licensing inquiries, please contact: xin-x25@mails.tsinghua.edu.cn or xiaojn25@mails.tsinghua.edu.cn

📚 Citation

If you find EurekAgent useful for your research, please cite our paper:

@misc{xin2026eurekagent,
  title = {EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery},
  author = {Amy Xin and Jiening Siow and Junjie Wang and Zijun Yao and Fanjin Zhang and Jian Song and Lei Hou and Juanzi Li},
  year = {2026},
  eprint = {2606.13662},
  archivePrefix = {arXiv},
  primaryClass = {cs.AI},
  url = {https://arxiv.org/abs/2606.13662}
}

About

Define your problem and evaluation criteria — EurekAgent coordinates off-the-shelf CLI agents to propose diverse approaches, implement them, run experiments, and iterate. Human intervention is optional but supported at every step.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages