We introduce Genome-Bench, a novel benchmark for evaluating and improving scientific reasoning in large language models. Genome-Bench consists of over 3,000 multiple-choice and QA items derived from CRISPR-related scientific discussions and forum threads, covering key topics in genome engineering, experimental design, and error analysis.
Our RL training pipeline (based on Group Relative Policy Optimization) improves model performance across expert-labeled evaluation sets. For example, our fine-tuned Qwen2.5-7B model exceeds GPT-4o in accuracy and consistency on multi-hop reasoning tasks.
git clone https://github.com/mingyin0312/RL4GenomeBench.git
cd RL4GenomeBench
pip install -r requirements.txtWe provide tools to parse .mbox email archives and convert them into standardized MCQ and QA formats.
cd dataset_pipeline
python 1_email_parse.py
python 2_convert_MCQ_full.py
python 3_dataset_prepare.py
python 4_convert_natural_question.pypython training/rl_training.py python training/sft_training.py python training/rl_router_training.py To evaluate on the Genome-Bench test data:
python evaluation/genome-bench_eval.py @article{yin2025genome,
title={Toward Scientific Reasoning in LLMs: Training from Expert Discussions via Reinforcement Learning},
author={Yin, Ming and Qu, Yuanhao and Ling, Yang and Cong, Le and Wang Mengdi},
journal={arXiv preprint arXiv:2505.19501},
year={2025}
}This project leverages the 🤗 Transformers Reinforcement Learning (TRL) library, which provides powerful tools for fine-tuning large language models with reinforcement learning techniques such as GRPO.

