💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

🎯 SWE‑Dev is the first large‑scale benchmark and training corpus for feature‑driven development (FDD) — the real‑world task of adding new functionality to existing codebases. It ships 14 000 training and 500 test tasks, each with a runnable environment and developer‑written unit tests, enabling both supervised fine‑tuning and reinforcement learning from executable rewards.

✨ Highlights

🌍 Real‑world FDD tasks drawn from mature open‑source projects.
⚙️ End‑to‑end reproducibility – every task bundles source, deps, Dockerfile & tests.
🤖 RL‑ready – deterministic pass/fail reward signals from pytest.
💪 Challenging – Claude‑3.7‑Sonnet reaches only 22.45 % Pass@3 on the hard split.
📈 Effective for model improvement – fine‑tuning a 7 B model on SWE‑Dev yields GPT‑4o‑level performance on hard split.

🚀 Getting Started

1. 🛠️ Installation

conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone [https://github.com/DorothyDUUU/SWE-Dev-dataset.git](https://github.com/DorothyDUUU/SWE-Dev-dataset.git)
cd SWE-Dev-dataset
pip install -r requirements.txt

```bash
conda create -n swe-dev python=3.12.0

# bleeding‑edge
git clone https://github.com/DorothyDUUU/SWE-Dev-dataset.git
cd SWE-Dev-dataset
pip install -r requirements.txt

2. 📥 Download the dataset & Build evaluation enviornment

Download dataset:

python dataset/download_data.py --dest ./data

The script organises the dataset as:

data/
 ├── train/
 │   ├── level1/
 │   ├── level2/
 │   └── level3/
 └── test/
     ├── Easy/
     └── Hard/

Docker Installation: Train set and test set are originated from different packages, thus the packages are installed in different docker images.

Test docker: (Need at least 10GB storage space for docker image)

python download_docker.py --split test

Train docker: (Need at least 100GB storage space for docker image)

python download_docker.py --split train

Docker Image for each sample: The docker image for each sample is the f"{package_name}-image", package_name is the value of package_name in sample metadata.

For instance, the image name for data/test/advertools-test_ad_create-level1-metadata.json, which package_name is advertools, the docker image for this sample is advertools-image.

Build evaluation API: For further usage for RL training, we wrapped the docker test in an API server, which could conviniently build in latter use.

3. ⏱️ Quick Inference

Single Agent Inference If you want to test on your own model, you can use the following command:

bash SWE-Dev-dataset/infer/single/run.sh

Multi-Agent Inference We also integrate 10 Multi-Agent Systems inference in the MASLab framework for SWE-Dev Dataset. Please refer to infer/MAS/README-MAS.md.

No.	Methodology	Venue	Role	Topo.	Tool	Generalization
1	Reflexion	NeurIPS 2023	Fixed	Fixed	No	Yes
2	Self-Consistency	ICLR 2024	Fixed	Fixed	No	Yes
3	LLM Debate	ICML 2024	Fixed	Fixed	No	Pre-defined Roles
4	MAD	EMNLP 2024	Fixed	Fixed	No	Pre-defined Roles
5	Self-Refine	NeurIPS 2024	Fixed	Fixed	No	Yes
6	AgentVerse	ICLR 2024	Dynamic	Fixed	No	Yes
7	MetaGPT	ICLR 2024	Fixed	Fixed	Yes	Coding-Specific
8	ChatDev	ACL 2024	Fixed	Fixed	Yes	Coding-Specific
9	MapCoder	ACL 2024	Fixed	Fixed	Yes	Coding-Specific
10	EvoMAC	ICLR 2025	Dynamic	Dynamic	Yes	Coding-Specific

4. Fine‑tuning

👤 Single-Agent SFT We use the Llama-Factory to conduct training, please refer to the train/single_agent_SFT.yaml for training parameters. SFT Dataset will be released in hugginggface.
👤 Single-Agent RL Comming soon...
👥 Multi-Agent SFT Comming soon...

🏆 Leaderboard

📊 We maintain a leaderboard at covering:

Category	#Methods	Easy Best Pass@1	Hard Best Pass@1
Chat LLMs	17	54.37 %	19.13 %
Reasoning LLMs	10	51.21 %	22.51 %
Multi‑Agent Systems	10	-	-

📢 News

[20250908] 🎉 Our benchmark is used by Kimi-K2 titter.!

[20250601] 🎉 Release the inference script and docker images for both test split and train split!

[20250522] 📄 Release the preprint version! See the preprint.

✍️ Citation

If you use SWE‑Dev, please cite:

@article{du2025swedev,
  title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
  author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
  journal={arXiv preprint arXiv:2505.16975},
  year={2025}
}

📝 License

Code and dataset are released under the Apache 2.0 license. See the LICENSE file for details.

🙏 Acknowledgements

We thanks for the MAS-Lab for contributing the multiagent system inference framework, Llama-Factory and Verl for providing training framework.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
assets		assets
dataset		dataset
docker		docker
infer		infer
train		train
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

✨ Highlights

🚀 Getting Started

1. 🛠️ Installation

2. 📥 Download the dataset & Build evaluation enviornment

3. ⏱️ Quick Inference

4. Fine‑tuning

🏆 Leaderboard

📢 News

✍️ Citation

📝 License

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

💻 SWE‑Dev: Evaluating and Training Autonomous Feature‑Driven Software Development

✨ Highlights

🚀 Getting Started

1. 🛠️ Installation

2. 📥 Download the dataset & Build evaluation enviornment

3. ⏱️ Quick Inference

4. Fine‑tuning

🏆 Leaderboard

📢 News

✍️ Citation

📝 License

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages