Skip to content

EVIGBYEN/RigorousBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

icon A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports

arXiv Preprint 2025

Yang Yao Yixu Wang Yuxuan Zhang Yi Lu Tianle Gu Lingyu Li
Dingyi Zhao Keming Wu Haozhe Wang Ping Nie Yan Teng Yingchun Wang

Shanghai Artificial Intelligence Laboratory The University of Hong Kong
Fudan University University of British Columbia University of Toronto
Tsinghua University  Shanghai Jiao Tong University
Hong Kong University of Science and Technology Peking University


🧠 Abstract

Artificial intelligence is undergoing the paradigm shift from closed language models to interconnected agent systems capable of external perception and information integration. As a representative embodiment, Deep Research Agents (DRAs) systematically exhibit the capabilities for task decomposition, cross-source retrieval, multi-stage reasoning, and structured output, which markedly enhance performance on complex and open-ended tasks. However, existing benchmarks remain deficient in evaluation dimensions, response formatting, and scoring mechanisms, limiting their capacity to assess such systems effectively. This paper introduces a rigorous benchmark and a multidimensional evaluation framework tailored to DRAs and report-style responses. The benchmark comprises 214 expert-curated challenging queries distributed across 10 broad thematic domains, each accompanied by manually constructed reference bundles to support composite evaluation. The framework enables comprehensive evaluation of long-form reports generated by DRAs, incorporating integrated scoring metrics for semantic quality, topical focus, and retrieval trustworthiness. Extensive experimentation confirms the superior performance of mainstream DRAs over web-search-tool-augmented reasoning models, yet reveals considerable scope for further improvement. This study provides a robust foundation for capability assessment, architectural refinement, and paradigm advancement in DRA systems.

🧪 Installation

git clone https://github.com/EVIGBYEN/RigorousBench.git
cd RigorousBench

📚 Citation

@article{yao2025rigorous,
  title={A Rigorous Benchmark with Multidimensional Evaluation for Deep Research Agents: From Answers to Reports},
  author={Yao, Yang and Wang, Yixu and Zhang, Yuxuan and Lu, Yi and Gu, Tianle and Li, Lingyu and Zhao, Dingyi and Wu, Keming and Wang, Haozhe and Nie, Ping and others},
  journal={arXiv preprint arXiv:2510.02190},
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published