Hi, nice to meet you!
I am Botao Yu (余博涛), a third-year PhD student at The Ohio State University, fortunately advised by Prof. Huan Sun. Previously, I earned my Master’s degree at Nanjing University.
My research focuses on language agents, tool-integrated reasoning, and scientific discovery. I build and evaluate agentic frameworks and systems for complex problem-solving, with expertise in agent architecture design, tool learning, and multi-level evaluation methodologies. I am particularly interested in developing LLM agents that can use/create tools to solve complex problems in both general and scientific domains.
I also have experience in natural language processing, information extraction, music understanding and generation, and computational chemistry, which provides me with diverse perspectives and transferable skills for tackling new research challenges.
🚀 Seeking a research scientist internship for summer 2026!
🌟 Featured Projects
SAGA
ChemToolAgent
Mind2Web 2
LlaSMol
🔥 News
- 2025.12: Check out our new preprint SAGA, an autonomous agent framework that automates objective function design for scientific discovery through a bi-level architecture.
- 2025.12: Check out our new preprint Scientific Discovery Evaluation (SDE), a scenario-grounded benchmark for evaluating LLMs in scientific discovery across biology, chemistry, materials, and physics.
- 2025.10: Check out our new preprint Holistic Agent Leaderboard (HAL), addressing key challenges in AI agent evaluation through standardized evaluation harness and three-dimensional analysis across models, scaffolds, and benchmarks.
- 2025.10: Our paper AutoSDT got the best paper award at the LLM for Scientific Discovery workshop @ COLM 2025 🎉🏆.
- 2025.09: Our paper Mind2Web 2 is accepted to NeurIPS 2025 🎉.
- 2025.09: Our paper LARC is accepted to AIAS 2025 and selected as the best paper award 🎉🏆.
- 2025.09: Check out our new preprint LARC, an agentic framework for constrained retrosynthesis planning.
- 2025.08: Our paper AutoSDT is accepted to EMNLP 2025 🎉.
- 2025.06: Check out our new preprint Mind2Web 2, a benchmark for evaluating agentic search with agent-as-a-judge.
- 2025.06: Check out our new preprint AutoSDT, an automated pipeline for generating high-quality scientific coding tasks.
- 2025.06: Check out 🛠️ChemMCP, our newly released, MCP-compatible chemistry toolkit for LLMs and AI assistants. Let’s build it together!
- 2025.05: Check out our new preprint Topic Association Analysis, where we investigated why LLMs misclassify benign comments as toxic from the topic association bias perspective.
- 2025.05: Our paper MMMU-Pro is accepted to ACL 2025.
- 2025.03: Our ChemAgent is now renamed to ChemToolAgent. Check out our new version with more experimental results at arXiv.
- 2025.01: Our paper ChemAgent is accepted to NAACL 2025 Findings.
- 2025.01: Our paper ScienceAgentBench is accepted to ICLR 2025.
- 2024.11: Please check out our new preprint ChemAgent, an enhanced chemistry agent and its performance on various chemistry problems.
- 2024.10: Please check out our new preprint ScienceAgentBench, a benchmark to assess language models in scientific tasks.
- 2024.09: Check out our new preprint MMMU-Pro, an enhanced version of MMMU featuring full-vision evaluation.
- 2024.07: Our paper LlaSMol is accepted to COLM 2024 🎉!
- 2024.05: Our paper MMMU is selected as Oral (0.8%) and nominated for best paper (24 in total) at CVPR 2024 🎊!
- 2024.02: Please check out our preprint LlaSMol, where we propose an awesome chemistry task instruction tuning dataset and a series of chemistry LLMs.
- 2023.08: Arrived at Columbus. My PhD journey officially starts 😋!
- 2023.05: Please check out our preprint MuseCoco, a text-to-music generation system.
- 2022.09: Our paper Museformer is accepted to NeurIPS 2022 🎉!
📝 Publications
-
[Preprint 2025] Accelerating Scientific Discovery with Autonomous Goal-evolving Agents
SAGA, a generalist agentic framework that automates objective planning for scientific discovery. It employs a bi-level architecture: an outer loop of LLM agents proposes new objectives, converts them into scoring functions, and analyzes optimization outcomes, while an inner loop performs solution optimization. Applied to antibiotic design, materials design, DNA sequence design, and chemical process design, results show that automating objective formulation can substantially improve the effectiveness of scientific discovery agents. -
[Preprint 2025] Evaluating Large Language Models in Scientific Discovery
A scenario-grounded benchmark for evaluating LLMs in scientific discovery across biology, chemistry, materials, and physics. The framework evaluates models at two levels: question-level accuracy on scenario-tied items, and project-level performance where models must propose testable hypotheses, design experiments, and interpret results. This addresses gaps in existing benchmarks which overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. -
[Preprint 2025] Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
A standardized evaluation infrastructure for AI agents featuring parallel evaluation harness and three-dimensional analysis across models, scaffolds, and benchmarks, shifting focus from benchmark performance to real-world reliability. -
[AIAS 2025] LARC: Towards Human-level Constrained Retrosynthesis Planning through an Agentic Framework
🏆 Best Paper Award at AIAS 2025LARC, the first LLM-based agentic framework for retrosynthesis planning under constraints. It incorporates agentic constraint evaluation directly into the retrosynthesis planning process, using agentic feedback grounded in tool-based reasoning to guide and constrain route generation. -
[NeurIPS 2025] Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1,000 hours of human labor. -
[EMNLP 2025] AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
🏆 Best Paper Award at the LLM for Scientific Discovery workshop @ COLM 2025We introduce AutoSDT, an automated pipeline for generating high-quality coding tasks from real-world data-driven scientific workflows, addressing the data scarcity challenge in building AI co-scientists. Using AutoSDT, we create AutoSDT-5K, the largest open dataset of its kind, enabling significant performance gains in scientific discovery benchmarks. -
[Preprint 2025] Probing Association Biases in LLM Moderation Over-Sensitivity
This paper investigates why large language models often misclassify benign comments as toxic, revealing that topic-level biases—rather than just offensive keywords—play a significant role. Using a novel Topic Association Analysis inspired by cognitive psychology, we uncover how LLMs' implicit associations influence moderation decisions. -
[NAACL 2025 Findings] ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving
A systematic investigation into tool-augmented language agents. Using chemistry as a testbed, ChemToolAgent reveals fundamental insights about when and how tools help agents: tools don't always improve performance and can introduce new error modes; whether tools help depends on specific tasks. We also release ChemMCP, an MCP-compatible toolkit for easily building chemistry co-scientists. -
[ICLR 2025] ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
The study introduces a benchmark for evaluating language models in scientific discovery, using 102 tasks from peer-reviewed publications and expert validation. It reveals current limitations in code generation, highlighting the need for rigorous task assessments. -
[ACL 2025] MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
An enhanced version of MMMU featuring full-vision evaluation for multi-discipline multimodal understanding. -
[COLM 2024] LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
Investigating how to adapt LLMs to specialized domains through instruction tuning. We demonstrate that careful data curation and task diversity matter more than scale, with our models significantly outperforming GPT-4 and Claude-3-Opus. These insights about domain adaptation generalize beyond chemistry to building capable agents in other specialized domains. -
[CVPR 2024 Oral] MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
This paper proposes a massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. -
[Preprint 2023] MuseCoco: Generating Symbolic Music from Text
A two-stage text-to-music generation system for creating symbolic music from textual descriptions. -
[Preprint 2023] EmoGen: Eliminating Subjective Bias in Emotional Music Generation
A method for generating emotional music while reducing subjective bias in the process. -
[NeurIPS 2022] Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation
We propose a fine- and coarse-grained attention mechanism for modeling the structures of music. -
[ISMIR 2022] MeloForm: Generating Melody with Musical Form Based on Expert Systems and Neural Networks
A system for generating melodies with musical form using a combination of expert systems and neural networks. -
[EMNLP 2021] Knowing False Negatives: An Adversarial Training Method for Distantly Supervised Relation Extraction
An adversarial training method to improve distantly supervised relation extraction by addressing false negatives. -
[APWeb-WAIM 2020] Joint Reasoning of Events, Participants and Locations for Plot Relation Recognition
A method for recognizing plot relations by jointly reasoning about events, participants, and locations in narratives.
📖 Education
-
PhD student in Computer Science and Engineering @ The Ohio State University
2023.08 - Now Columbus, Ohio, USA
-
Master’s student in Computer Science @ Nanjing University (南京大学)
2019.09 - 2023.06 Nanjing, Jiangsu, China
-
Undergraduate student in Software Engineering @ Dalian University of Technology (大连理工大学)
2015.09 - 2019.06 Dalian, Liaoning, China
-
High school student @ The High School Attached To Hunan Normal University (湖南师大附中)
2012.09 - 2015.06 Changsha, Hunan, China
👨🏻💻 Internship
-
Research intern @ Microsoft Research Asia (微软亚洲研究院)
2021.04 - 2022.03 Beijing, China
💻 Service
- 2025: Reviewer for ARR 2025 (Feb., May, July, Oct.), COLM 2025, NeurIPS 2025 SEA Workshop, ICLR 2026
- 2024: Reviewer for ICLR 2025, ARR 2024 (Dec.), AAAI 2025 AI4Research Workshop
Last updated: Dec 31, 2025