DeepResearch Eval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
🌐 Website | 📑 Paper | 🤗 Dataset | 🐥 Submission
- 2025-01-15: 🔥 We release the DeepResearch Eval and the paper.
- We introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation.
- For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval.
- For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
Overview of deep research systems' performance on our benchmark. The left panel reports quality evaluation results across deep research systems, with Gemini-2.5-Pro achieving the highest score (8.51/10). The right panel reports factual correctness, where Manus achieves the highest ratio of correct statements (82.3%).
For installation,
We recommend using uv with python >= 3.10
# Clone the repo
git clone https://github.com/Infinity-AILab/DeepResearchEval.git
cd DeepResearchEval
# Install dependencies and create virtual environment
uv sync
# Activate the virtual environment
source .venv/bin/activateAfter activation, you can run Python commands directly without uv run prefix.
Generate expert-level tasks that require deep web search and information synthesis.
# Run complete pipeline
python task_generation/main.py --output_file ./task_generation/outputs/deep_research_tasks.jsonl --model_name gpt-5-miniFor detailed usage, parameters, and examples, see task_generation/README.md.
For installation,
cd poin_quality
pip install -r requirements.txtFor usage,
# To use google/gemini-2.5-pro-preview as the judge LLM
export OPENROUTER_API_KEY="your_openrouter_api_key"
cd poin_quality
python example_pointwise_usage.pyWhen running the script, the judging process follows this logic:
-
If
criteria_cache.json,dimensions_cache.json, andweights_cache.jsonalready exist in./point_quality/outputs/cache/, the script will directly reuse the cached criteria, dimensions, and weights to perform point-wise judging. -
Otherwise, the script will first generate task-specific dimensions, criteria, and weights, cache them under
./point_quality/outputs/cache/, and then proceed with the judging process.
The point-wise evaluation is configured via a YAML file located at:
./point_quality/deepresearcharena/config/pointwise.yaml
You can modify the judge LLM settings under the evaluator_model field in the configuration file, including the model name and related parameters (e.g., temperature, max tokens).
The models (or methods) to be evaluated are specified under the target_models field. For example, if your evaluation results are stored in: ./data/method_results/aaa/, ./data/method_results/bbb/ . you should configure:
target_models:
- "aaa"
- "bbb"
For active fact-checking, we implement a fact-checking agent based on MiroFlow.
We recommend using uv with python >= 3.10
Step1: prepare python environment:
# Run complete pipeline
cd factual_eval/apps/run-agent
uv syncStep2: Set up environment dependencies:
cd factual_eval/apps/run-agent
vim .env
# Set the API KEY
# OPENROUTER_API_KEY (Using OpenRouter to provide primary agent model)
# OPENAI_API_KEY for openai models
# SERPER_API_KEY (for Google search and website scraping)Step3: Fact-checking evaluation
cd factual_eval/apps/run-agent
uv run batch_test.py --json_dir ../../../data/method_results/gemini_2.5_pro # replace with your file name
# or runs the evaluation in the background and records logs to a log file:
bash batch_fact.shThe configurations for the framework, agent, and LLM (default: gpt-5-mini) are defined under:
./factual_eval/libs/miroflow/src/miroflow/prebuilt/config
You can check more details of our active fact-checking in factual_eval/README.md
We thank the MiroFlow and DAComp for their open source contribution.
If you find our work helpful, please cite as
@misc{wang2026deepresearchevalautomatedframeworkdeep,
title={DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation},
author={Yibo Wang and Lei Wang and Yue Deng and Keming Wu and Yao Xiao and Huanjin Yao and Liwei Kang and Hai Ye and Yongcheng Jing and Lidong Bing},
year={2026},
eprint={2601.09688},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.09688},
}
