This is the repository for the paper SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning (NeurIPS 2025).
SIRIUS is a self-improving multi-agent framework that continuously enhances reasoning ability by maintaining an experience library of successful trajectories and bootstrapping from failed ones.
We support three main multi-agent settings, each with its own directory:
Problem_solving/– collaborative QA (College Physics/Chemistry, PubMedQA-style)Actor_Critic/– Actor + Judgment + Critic for iterative refinementCompetitive/– negotiation / game-theoretic interactions
git clone https://github.com/zou-group/sirius.git
cd siriusconda create -n sirius python=3.10
conda activate sirius
conda env create -f environment.ymlSet your keys as environment variables or in a config file as used by the codebase, for example:
export OPENAI_API_KEY=...-
Problem_solving/Pipelines for college-level reasoning & biomedical QA:- College Physics / College Chemistry
- PubMedQA-style question answering (long context + question)
-
Actor_Critic/Pipelines for the Actor–Judgment–Critic setting:- Actor proposes an answer
- Judgment agent decides correct / incorrect
- Critic writes feedback and guides regeneration
-
Competitive/Pipelines for competitive games:- Resource Exchange
- Sell & Buy
- Ultimatum Each is a two-player turn-based game with utilities defined in the paper.
SiriuS operates on trajectories:
-
A trajectory is the full interaction between agents for one task instance:
-
Input question / context
-
Intermediate messages from each agent (Physicist, Mathematician, Summarizer, Actor, Critic, etc.)
-
Final answer(s) or game outcome
-
Reward signal(s) (accuracy or utility)
First, run the multi-agent system (with base models) on your tasks and log the full interaction.
A sample training dataset (for physics problem solving) is already provided at:
dataset/phy_train.jsonlEach line of this file is one training example (e.g., one physics problem) that the multi-agent system will solve.
Put your training and eval data at
dataset/{subject}_train.jsonl
dataset/{subject}_test.jsonlEach subdirectory provides task-specific drivers to:
-
Load the dataset
-
Instantiate the appropriate agent graph (see the paper for structures)
Problem_solving/PhyChem/agent.py
-
solve the problems, collect full trajectories
python Problem_solving/PhyChem/get_a_sol.py --model='gpt-3.5-turbo' --task='MMLU_physics' --prompt_type='multi_agent' --mode='generate' --subject='phy'
python libs/merge.pyFirst, generate feedback for trajectories where the agents produced incorrect solutions:
python Problem_solving/PhyChem/get_b_feedback.py --model='gpt-3.5-turbo' --task='MMLU_physics' --prompt_type='multi_agent' --mode='generate' --subject='phy'Then, regenerate improved trajectories conditioned on this feedback:
python Problem_solving/PhyChem/get_c_regenerate.py --model='gpt-3.5-turbo' --task='MMLU_physics' --prompt_type='multi_agent' --mode='generate' --subject='phy'We use the OpenAI Supervised Fine-Tuning (SFT) API in our example, but you can plug in any fine-tuning framework of your choice using the constructed experience library:
python Problem_solving/PhyChem/get_finetune_data.py
python Problem_solving/PhyChem/fine_tune.py@article{zhao2025sirius,
title={SiriuS: Self-improving Multi-agent Systems via Bootstrapped Reasoning},
author={Zhao, Wanjia and Yuksekgonul, Mert and Wu, Shirley and Zou, James},
journal={arXiv preprint arXiv:2502.04780},
year={2025}
}