Skip to content

DorothyDUUU/SWE-Dev

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

project arXiv License DockerHub

🎯 SWE‑Dev is the first large‑scale benchmark and training corpus for feature‑driven development (FDD) — the real‑world task of adding new functionality to existing codebases. It ships 14 000 training and 500 test tasks, each with a runnable environment and developer‑written unit tests, enabling both supervised fine‑tuning and reinforcement learning from executable rewards.

📄 Dataset Overview


✨ Highlights

  • 🌍 Real‑world FDD tasks drawn from mature open‑source projects.
  • ⚙️ End‑to‑end reproducibility – every task bundles source, deps, Dockerfile & tests.
  • 🤖 RL‑ready – deterministic pass/fail reward signals from pytest.
  • 💪 Challenging – Claude‑3.7‑Sonnet reaches only 22.45 % Pass@3 on the hard split.
  • 📈 Effective for model improvement – fine‑tuning a 7 B model on SWE‑Dev yields GPT‑4o‑level performance on hard split.

🚀 Getting Started

1. 🛠️ Installation

conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone [https://github.com/DorothyDUUU/SWE-Dev-dataset.git](https://github.com/DorothyDUUU/SWE-Dev-dataset.git)
cd SWE-Dev-dataset
pip install -r requirements.txt

```bash
conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone https://github.com/DorothyDUUU/SWE-Dev-dataset.git
cd SWE-Dev-dataset
pip install -r requirements.txt

2. 📥 Download the dataset & Build evaluation enviornment

Download dataset:

python dataset/download_data.py --dest ./data

The script organises the dataset as:

data/
 ├── train/
 │   ├── level1/
 │   ├── level2/
 │   └── level3/
 └── test/
     ├── Easy/
     └── Hard/

Docker Installation: Train set and test set are originated from different packages, thus the packages are installed in different docker images.

Test docker: (Need at least 10GB storage space for docker image)

python download_docker.py --split test

Train docker: (Need at least 100GB storage space for docker image)

python download_docker.py --split train

Docker Image for each sample: The docker image for each sample is the f"{package_name}-image", package_name is the value of package_name in sample metadata.

For instance, the image name for data/test/advertools-test_ad_create-level1-metadata.json, which package_name is advertools, the docker image for this sample is advertools-image.

Build evaluation API: For further usage for RL training, we wrapped the docker test in an API server, which could conviniently build in latter use.

3. ⏱️ Quick Inference

Single Agent Inference If you want to test on your own model, you can use the following command:

bash SWE-Dev-dataset/infer/single/run.sh

Multi-Agent Inference We also integrate 10 Multi-Agent Systems inference in the MASLab framework for SWE-Dev Dataset. Please refer to infer/MAS/README-MAS.md.

No. Methodology Venue Role Topo. Tool Generalization
1 Reflexion NeurIPS 2023 Fixed Fixed No Yes
2 Self-Consistency ICLR 2024 Fixed Fixed No Yes
3 LLM Debate ICML 2024 Fixed Fixed No Pre-defined Roles
4 MAD EMNLP 2024 Fixed Fixed No Pre-defined Roles
5 Self-Refine NeurIPS 2024 Fixed Fixed No Yes
6 AgentVerse ICLR 2024 Dynamic Fixed No Yes
7 MetaGPT ICLR 2024 Fixed Fixed Yes Coding-Specific
8 ChatDev ACL 2024 Fixed Fixed Yes Coding-Specific
9 MapCoder ACL 2024 Fixed Fixed Yes Coding-Specific
10 EvoMAC ICLR 2025 Dynamic Dynamic Yes Coding-Specific

4. Fine‑tuning

  1. 👤 Single-Agent SFT We use the Llama-Factory to conduct training, please refer to the train/single_agent_SFT.yaml for training parameters. SFT Dataset will be released in hugginggface.

  2. 👤 Single-Agent RL Comming soon...

  3. 👥 Multi-Agent SFT Comming soon...


🏆 Leaderboard

📊 We maintain a leaderboard at covering:

Category #Methods Easy Best Pass@1 Hard Best Pass@1
Chat LLMs 17 54.37 % 19.13 %
Reasoning LLMs 10 51.21 % 22.51 %
Multi‑Agent Systems 10 - -

Single LLM


📢 News

[20250908] 🎉 Our benchmark is used by Kimi-K2 titter.!

[20250601] 🎉 Release the inference script and docker images for both test split and train split!

[20250522] 📄 Release the preprint version! See the preprint.


✍️ Citation

If you use SWE‑Dev, please cite:

@article{du2025swedev,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}

📝 License

Code and dataset are released under the Apache 2.0 license. See the LICENSE file for details.

🙏 Acknowledgements

We thanks for the MAS-Lab for contributing the multiagent system inference framework, Llama-Factory and Verl for providing training framework.

About

Official code space for "SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors