Paper Link: https://arxiv.org/abs/2601.10504
Project Website: https://inlp-lab.github.io/DR-Arena/
DR-Arena is a dynamic, automated benchmarking framework designed to evaluate the limits of Deep Research (DR) Agents. Unlike static benchmarks, DR-Arena constructs real-time Information Trees from live web trends and employs an Adaptive Evolvement Loop to dynamically escalate task complexity, testing agents on both Deep Reasoning (deduction depth) and Wide Coverage (information aggregation).
Figure 1: Overview of the DR-Arena Framework. The Automated Examiner generates questions based on the Information Tree, and the Evolvement Loop dynamically adjusts task difficulty.
- Dynamic Information Trees: Automatically constructs navigable website structures (
web_tree) mimicking real-world logic and noise, ensuring evaluation against the live state of the web. - Automated Examiner: A specialized agent that synthesizes "Deep & Wide" questions based on the hidden topology of the generated trees, ensuring questions require multi-hop reasoning.
- Adaptive Evolvement Loop: A state-machine controller that monitors agent performance in real-time. If agents reach a stalemate, the system evolves the environment by increasing Tree Depth (logic complexity) or Width (search scope).
- Automated Arena: A tournament system supporting 1v1 battles between DR agents with automatic Elo rating calculation.
DR-Arena/
├── core/ # Core Arena Logic
│ ├── agents.py # Search Agent Wrappers
│ ├── api_client.py # OpenRouter API Client
│ ├── evolvement_loop.py # Adaptive Evolvement Loop (State Machine)
│ ├── examiner.py # Automated Examiner (Question Generation & Judging)
│ ├── score_utils.py # Elo Scoring & Win-rate Calculation
│ ├── tracker.py # Token Usage Tracking
│ └── utils.py # General Utilities
├── web_tree/ # Dynamic Information Tree Generator
│ ├── data/ # Dataset Storage & Trends Source
│ ├── models/ # Data Models (WebsiteNode, LinkContext)
│ ├── utils/ # Generator Utilities (Crawler, LLM Agent, Validator)
│ ├── crawl.py # Base Crawler Entry Point
│ ├── expand_tree.py # Tree Expansion Tool
│ ├── generate_dataset.py # Automated Tree Generation Script
│ ├── test_components.py # Component Verification Script
│ ├── visualize.py # Tree Topology Visualization Tool
│ ├── QUICKSTART.md # Generator Quick Start Guide
│ └── README.md # Generator Documentation
├── tournament_cli.py # Tournament Command Line Interface
├── main.py # Single-Match Entry Point
├── config.py # Configuration Template
└── requirements.txt # Project Dependencies
To ensure the reproducibility of the experiments reported in our paper, we have retained the 30 specific Information Trees used in our evaluation.
- Location:
web_tree/data/dataset/trees/ - Description: These 30 trees serve as the standardized "Hidden Knowledge" environments for the competition. They cover diverse domains (e.g., Technology, Finance, Science) and vary in initial topology, providing a consistent baseline for comparing new agents against the results reported in the paper.
Figure 2: An example of a "Deep & Wide" Information Tree topology used in the evaluation.
Requires Python 3.9+.
cd DR-Arena
pip install -r requirements.txtAdd your API keys in config.py or export them as environment variables.
To replicate the full benchmarking process (Pairing -> Battling -> Ranking):
# 1. Initialize Pairing (Round 1)
python tournament_cli.py --action pair --round 1
# 2. Start Battles (Supports multi-worker parallelization)
python tournament_cli.py --action battle --worker_id 0 --total_workers 1
# 3. Calculate Elo Rankings
python tournament_cli.py --action rankTo run a specific battle between two agents on a specific tree:
python main.pyFollow the interactive prompts to select a tree from web_tree/data/ and define the contending agents.
If you wish to expand the benchmark with fresh data, use the web_tree module. This process scrapes live web data to build new Information Trees.
Note: For detailed instructions, please refer to web_tree/README.md and web_tree/QUICKSTART.md.
Ensure your API keys are set in config.py or exported as environment variables:
export ANTHROPIC_API_KEY='your-key-here' # For Query Generation & Selection
export SERPAPI_API_KEY='your-key-here' # For Google SearchBefore generating a large dataset, run the test suite to ensure all APIs are working:
python web_tree/test_components.pyThis will verify the Trends Parser, LLM Agent, Search API, and Crawler connectivity.
Run the generation script. You can customize the number of trees and their complexity:
cd web_tree
# Generate 50 new valid trees with a max depth of 3
python generate_dataset.py --target 50 --max-depth 3Parameters:
--target: Number of valid trees to generate (default: 100)--max-depth: Maximum crawl depth (default: 3)--max-children: Maximum links per node (default: 10)--crawl-delay: Safety delay between requests (default: 1.0s)
You can inspect the topology of any generated tree using the visualization tool:
# Interactive Menu
python visualize.py data/dataset/trees/tree_0001.json
# Export a Clustered View (Recommended)
python visualize.py data/dataset/trees/tree_0001.json --style clusteredIf you use DR-Arena in your research, please cite our paper:
@misc{gao2026drarenaautomatedevaluationframework,
title={DR-Arena: an Automated Evaluation Framework for Deep Research Agents},
author={Yiwen Gao and Ruochen Zhao and Yang Deng and Wenxuan Zhang},
year={2026},
eprint={2601.10504},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.10504},
}