Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Arbor is an autonomous research agent that turns a long-horizon objective into a cumulative search. Give it a benchmark and a goal; it proposes hypotheses, edits code, runs real experiments, learns from the results, and keeps the improvements that hold up on held-out data. Instead of one-shot attempts that forget what failed, Arbor grows a hypothesis tree: every idea becomes a branch — pruned if it fails, harvested if it works — and insights propagate back so later ideas start smarter.

For more details, visit our project page and read the paper. For a more detailed usage manual, see our documentation. 🧭 You can also choose the CLI or Skill version depending on your environment and workflow.

💡 Why Arbor

General-purpose optimization — From model training and harness engineering to data synthesis, Arbor can optimize any task as long as it has a target to improve and a metric to measure progress.
Practical agent runtime — Arbor is not only a research prototype; it ships with both a native CLI runtime and an Agent Skill Suite for Codex and Claude Code, so you can use the full CLI for the strongest Arbor behavior or load the skill suite inside another coding agent.
Long-horizon structured exploration — The hypothesis-tree framework lets Arbor keep running as a cumulative search: results, failure modes, and distilled insights persist in the Idea Tree and propagate upward, so later ideas start smarter instead of being lost in a scrollback buffer.
Real experiment discipline — Executors iterate on a dev split, validate on a held-out test split, and only merge gains that clear a configurable margin, reducing overfitting to the metric being optimized.
Isolated, reversible execution — Every experiment runs in its own git worktree on a dedicated branch, so your main branch is never touched until you choose to merge.
Built for long experiments — Long-running training is first-class, with generous timeouts, partial-metric recovery on timeout, and optional staged budgets from smoke to pilot to full runs.
Model and workflow flexibility — Arbor supports Anthropic, OpenAI / Responses API, and OpenAI-compatible backends through LiteLLM, including DeepSeek, Gemini, Qwen, vLLM, Ollama, and local gateways.
Steerable and adaptable — A live terminal dashboard, read-only WebUI, optional human-in-the-loop review, and one-line domain plugins let you steer experiments without changing Arbor's core code.

🧩 Framework

Arbor runs two cooperating agents:

Coordinator — the research director. It maintains the Idea Tree, drives the search via the arbor cycle, and dispatches experiments.
Executor — the research engineer. Given one idea, it faithfully implements the code changes, runs the experiment in an isolated git worktree, and reports evidence.

Together they repeat a six-step arbor cycle:

Observe — the Coordinator re-grounds itself in the Idea Tree, reading the active frontier, constraints, ancestor insights, recent evidence, and current best artifact.
Ideate — it chooses a parent node and proposes child hypotheses that refine, correct, or extend what the tree has already learned.
Select — it chooses the most promising pending leaves to test, balancing the current best direction with unresolved alternatives.
Dispatch — selected hypotheses are sent to independent Executors, which implement them in fresh worktrees and evaluate them on the dev signal.
Backpropagate — Arbor records each result, score, insight, and branch, then abstracts the lesson upward so ancestor nodes and future ideas inherit it.
Decide — the Coordinator chooses whether to merge, prune, continue, leave a node pending, or stop, using held-out validation for merge decisions.

🎬 Demo

demo.mp4

🚀 CLI And Skill Versions

This repository includes two ways to use Arbor:

Version	Location	Best for	Recommendation
Native CLI runtime	Python package and `arbor` command	Real Arbor research runs, long experiments, dashboard, checkpoints, executor tools, merge/test discipline, plugins, reports	Recommended. This path is more complete, more reliable, and gives the best Arbor behavior.
Agent Skill Suite	`skills/`	Codex or Claude Code environments where you want Arbor-style behavior without running the native Arbor runtime	Useful integration layer and fallback, but less complete than the CLI runtime.

If you can run the CLI, use the CLI. The native arbor runtime contains the full implementation: intake, Research Contract, live dashboard, EventBus, checkpoint/resume, executor dispatch, protected dev/test evaluation discipline, SearchAgent, plugins, and final report generation.

The repo-root skills/ directory is a Codex/Claude Code skill suite. After installation, invoke $arbor-research-agent in Codex or /arbor-research-agent in Claude Code and describe your research objective as you would in Arbor. The skill suite performs Arbor-style clarification first when target, metric, data, permissions, budget, or run mode are unclear, then loads the orchestrator and phase skills. This is separate from the internal runtime skills stored under src/skills/.

📦 Install

Requirements: Python ≥ 3.10 and Git. A virtual environment is recommended.

git clone https://github.com/RUC-NLPIR/Arbor.git
cd Arbor
python -m venv .venv && source .venv/bin/activate   # recommended
pip install -e .                                    # or: uv pip install -e .
arbor doctor                                        # verify PATH, git, API keys

Prefer a global command? pipx install -e . makes arbor available everywhere. For the docs site, pip install -e ".[docs]" && mkdocs serve, or read them online via the Docs badge above.

⚡ Getting Started

arbor setup       # one-time: configure provider / model / base_url / API key
arbor             # start an interactive session in the current directory
arbor doctor      # diagnose the install

arbor setup writes ~/.arbor/config.yaml, so day-to-day you can just run arbor with no flags. The first thing Arbor does is an intake conversation that turns your goal, target directory, metric, baseline, budget, dev/test discipline, and artifact paths into a one-screen Arbor Research Contract. Once you confirm it, the live dashboard takes over.

# Point at a benchmark directory and a config
arbor --cwd ./benchmark --config research_config.yaml

# Give an initial goal up front; intake refines the rest
arbor "improve validation score without touching the test split" --cwd ./benchmark

# Small dry run
arbor --cwd ./benchmark --config research_config.yaml --max-cycles 3

During a run you can type /status, /tree, /evidence, /branches, /cost, /pause, /resume, /report, or /abort.

Prepare a benchmark

Your target directory should have:

a runnable evaluation script (e.g. run_eval.py),
evaluation data (ideally a dev split and a held-out test split), and
a clean git repository (no uncommitted changes).

A minimal research_config.yaml:

# LLM/API live in `arbor setup`; project config is usually just the task and budget.
task: >
  Optimize the agent's accuracy on the benchmark.
  Do NOT modify the evaluation harness or data files.

coordinator:
  max_cycles: 10          # arbor cycles to explore
  max_depth: 2            # Idea Tree depth
  merge_threshold: 5.0    # min held-out % gain to merge into trunk
  ui:
    interaction_mode: review   # auto | direction | review | collaborative

executor:
  max_turns: 100

A copy-pasteable example with every option lives in examples/research_config.example.yaml.

🧠 How It Works

The arbor cycle

Each cycle runs six steps:

① OBSERVE   analyze current results and failure modes
② IDEATE    propose 1–3 new ideas from the analysis and tree insights
③ SELECT    pick the highest-priority idea to test
④ DISPATCH  run an Executor on it in an isolated git worktree
⑤ BACKPROP  record the result; abstract the insight up to ancestor nodes
⑥ DECIDE    continue / merge into trunk / prune / stop

The Idea Tree

ROOT (baseline: 20%)
├── 1: Retrieval optimization        [insight: "retrieval quality is the bottleneck"]
│   ├── 1.1: Constraint decomposition + verification   [40%, merged]
│   ├── 1.2: Periodic re-read injection                [40%, pruned — no net gain]
│   └── 1.3: Answer-extraction tuning                  [35%, pruned]
├── 2: Multi-perspective search      [insight: "search scaffolding hurts here"]
│   └── 2.1: Breadth-first search                      [25%, pruned]
└── 3: Code-level intervention       [insight: "code-level > prompt-level"]
    ├── 3.1: Continuation injection                    [70%, merged]
    └── 3.2: ANSWER-tag extraction                     [45%, done]

Depth 0 (Root): the research objective and global insights.
Depth 1: research directions (paper-title-level ideas).
Depth 2+: concrete methods, implemented and tested by Executors.

Git strategy & evaluation

Each Executor works in its own worktree on a dedicated branch. Verified improvements merge into a per-run trunk; you promote trunk into main only when satisfied (git merge research/run_xxx/trunk). Executors iterate on a dev split, but a change is kept only if it clears a margin on the held-out test split — guarding against overfitting.

Human-in-the-loop

Set ui.interaction_mode (or --interaction-mode) to choose how much you steer:

Mode	Behavior
`auto`	Fully autonomous.
`direction`	Asks you where to go next at ideation.
`review`	Pauses before each node and Executor.
`collaborative`	`direction` + `review`.

When paused, your input opens an isolated discussion with a read-only companion — it never pollutes the Coordinator's context. See docs/ for the full method.

⚙️ Configuration

LLM access is configured once with arbor setup (stored in ~/.arbor/config.yaml) via a single provider field — anthropic, openai (incl. any OpenAI-compatible Responses endpoint), or litellm for DeepSeek / Gemini / Qwen / vLLM / Ollama / local gateways. Keys come from the environment or the config; per-project task and budget settings live in research_config.yaml. See the configuration guide and examples/research_config.example.yaml for every option.

🧰 CLI Reference

Day to day you only need arbor:

Command	What it does
`arbor`	Start an interactive research session.
`arbor setup`	Configure provider / model / keys → `~/.arbor/config.yaml`.
`arbor report <session>`	Re-render `REPORT.md` for a past session.
`arbor doctor`	Diagnose install, PATH, git, and API keys.
`arbor version`	Print the installed version.

Lower-level entry points (run-research, coordinator, executor, review-research) remain for debugging — see the CLI reference.

🔌 Plugins & Skills

A single line retargets the agent to a new domain — evaluation protocol, protected data directories, required outputs, and timeout presets all come from the plugin:

plugin: mle_kaggle   # switches to Kaggle/MLE mode

A plugin is one YAML file (prompt-injection points + config overrides + profiles + lifecycle hooks + an eval contract); a Skill is a markdown playbook the agent loads on demand at runtime. A copy-pasteable Kaggle config lives in examples/kaggle_config.example.yaml.

💾 Output & Resume

Each run writes a session directory with REPORT.md, events.jsonl, run_stats.json, the Idea Tree, and per-experiment artifacts under .arbor/sessions/. Runs are resumable — interrupt with Ctrl+C and continue later with --resume; Arbor reloads the Idea Tree and picks up where it left off.

arbor report .arbor/sessions/<run_name>   # re-render a past report
arbor --resume --run-name <run_name>      # continue an interrupted run

📊 Results

Arbor was evaluated as a single controller across model training, harness engineering, and data synthesis — only the material, objective, evaluator, and budget change. It wins the held-out test on all six tasks against strong single-agent baselines.

Task	Direction	Initial	Codex	Claude Code	Arbor	Gain
Optimizer Design	steps ↓	3325	3325	3287.5	3237.5	+2.63%
Architecture Design	loss ↓	1.098	1.083	1.033	1.028	+6.38%
Terminal-Bench 2.0	pass ↑	69.81	73.59	71.70	77.36	+7.55
BrowseComp	acc ↑	45.33	50.00	53.33	67.67	+22.34
Search-Agent Data	gap ↑	5.00	9.00	12.00	18.00	+13.0
Math-Reasoning Data	gap ↑	1.04	6.25	8.33	20.83	+19.79

On MLE-Bench Lite with GPT-5.5, Arbor reaches 86.36% Any-Medal (100% valid submissions, 95.45% above median, 77.27% gold). See the paper for full protocols and ablations.

🗂️ Project Structure

The code lives in src/ and is imported as the arbor package.

src/                 # the `arbor` package
├── core/            Shared infrastructure: ReAct loop, tools, LLM providers, context mgmt
├── executor/        Executor agent + `executor` CLI
├── coordinator/     Coordinator agent, Idea Tree, orchestrator, coordinator tools
├── cli/             `arbor` CLI: intake, live dashboard, setup, doctor, config
├── events/          Typed event bus and payloads
├── report/          Report generation
├── webui/           Read-only run-monitoring web server
├── plugins/         Domain plugins (e.g. mle_kaggle.yaml)
├── skills/          On-demand markdown playbooks
├── dashboard.py     HTML dashboard generator
├── run.py           `run-research` CLI
└── review.py        `review-research` CLI

🙏 Acknowledgements

Arbor is built on the excellent foundation of claw-code.

claw-code is an open-source Rust reimplementation of Claude Code. It provided the REPL framework, tool-calling infrastructure, and cross-platform compilation that made Arbor's CLI possible. Huge thanks to the ultraworkers team for their outstanding work.

🔗 claw-code: https://github.com/ultraworkers/claw-code

📚 Citation

@misc{jin2026arbor,
  title  = {Toward Generalist Autonomous Research via Hypothesis-Tree Refinement},
  author = {Jiajie Jin and Yuyang Hu and Kai Qiu and Qi Dai and Chong Luo and
            Guanting Dong and Xiaoxi Li and Tong Zhao and Xiaolong Ma and
            Gongrui Zhang and Zhirong Wu and Bei Liu and Zhengyuan Yang and
            Linjie Li and Lijuan Wang and Hongjin Qian and Yutao Zhu and Zhicheng Dou},
  year   = {2026},
  eprint = {2606.11926},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2606.11926}
}

Star History

📄 License

Released under the Apache License 2.0.

Built at the Gaoling School of Artificial Intelligence, Renmin University of China, and Microsoft Research.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
project_page		project_page
skills		skills
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

💡 Why Arbor

🧩 Framework

🎬 Demo

🚀 CLI And Skill Versions

📦 Install

⚡ Getting Started

Prepare a benchmark

🧠 How It Works

The arbor cycle

The Idea Tree

Git strategy & evaluation

Human-in-the-loop

⚙️ Configuration

🧰 CLI Reference

🔌 Plugins & Skills

💾 Output & Resume

📊 Results

🗂️ Project Structure

🙏 Acknowledgements

📚 Citation

Star History

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

💡 Why Arbor

🧩 Framework

🎬 Demo

🚀 CLI And Skill Versions

📦 Install

⚡ Getting Started

Prepare a benchmark

🧠 How It Works

The arbor cycle

The Idea Tree

Git strategy & evaluation

Human-in-the-loop

⚙️ Configuration

🧰 CLI Reference

🔌 Plugins & Skills

💾 Output & Resume

📊 Results

🗂️ Project Structure

🙏 Acknowledgements

📚 Citation

Star History

📄 License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages