🎯 SWE‑Dev is the first large‑scale benchmark and training corpus for feature‑driven development (FDD) — the real‑world task of adding new functionality to existing codebases. It ships 14 000 training and 500 test tasks, each with a runnable environment and developer‑written unit tests, enabling both supervised fine‑tuning and reinforcement learning from executable rewards.
- 🌍 Real‑world FDD tasks drawn from mature open‑source projects.
- ⚙️ End‑to‑end reproducibility – every task bundles source, deps, Dockerfile & tests.
- 🤖 RL‑ready – deterministic pass/fail reward signals from pytest.
- 💪 Challenging – Claude‑3.7‑Sonnet reaches only 22.45 % Pass@3 on the hard split.
- 📈 Effective for model improvement – fine‑tuning a 7 B model on SWE‑Dev yields GPT‑4o‑level performance on hard split.
conda create -n swe-dev python=3.12.0
# bleeding‑edge
git clone [https://github.com/DorothyDUUU/SWE-Dev-dataset.git](https://github.com/DorothyDUUU/SWE-Dev-dataset.git)
cd SWE-Dev-dataset
pip install -r requirements.txt
```bash
conda create -n swe-dev python=3.12.0
# bleeding‑edge
git clone https://github.com/DorothyDUUU/SWE-Dev-dataset.git
cd SWE-Dev-dataset
pip install -r requirements.txtDownload dataset:
python dataset/download_data.py --dest ./dataThe script organises the dataset as:
data/
├── train/
│ ├── level1/
│ ├── level2/
│ └── level3/
└── test/
├── Easy/
└── Hard/
Docker Installation: Train set and test set are originated from different packages, thus the packages are installed in different docker images.
Test docker: (Need at least 10GB storage space for docker image)
python download_docker.py --split testTrain docker: (Need at least 100GB storage space for docker image)
python download_docker.py --split trainDocker Image for each sample:
The docker image for each sample is the f"{package_name}-image", package_name is the value of package_name in sample metadata.
For instance, the image name for data/test/advertools-test_ad_create-level1-metadata.json, which package_name is advertools, the docker image for this sample is advertools-image.
Build evaluation API: For further usage for RL training, we wrapped the docker test in an API server, which could conviniently build in latter use.
Single Agent Inference If you want to test on your own model, you can use the following command:
bash SWE-Dev-dataset/infer/single/run.shMulti-Agent Inference
We also integrate 10 Multi-Agent Systems inference in the MASLab framework for SWE-Dev Dataset. Please refer to infer/MAS/README-MAS.md.
| No. | Methodology | Venue | Role | Topo. | Tool | Generalization |
|---|---|---|---|---|---|---|
| 1 | Reflexion | NeurIPS 2023 | Fixed | Fixed | No | Yes |
| 2 | Self-Consistency | ICLR 2024 | Fixed | Fixed | No | Yes |
| 3 | LLM Debate | ICML 2024 | Fixed | Fixed | No | Pre-defined Roles |
| 4 | MAD | EMNLP 2024 | Fixed | Fixed | No | Pre-defined Roles |
| 5 | Self-Refine | NeurIPS 2024 | Fixed | Fixed | No | Yes |
| 6 | AgentVerse | ICLR 2024 | Dynamic | Fixed | No | Yes |
| 7 | MetaGPT | ICLR 2024 | Fixed | Fixed | Yes | Coding-Specific |
| 8 | ChatDev | ACL 2024 | Fixed | Fixed | Yes | Coding-Specific |
| 9 | MapCoder | ACL 2024 | Fixed | Fixed | Yes | Coding-Specific |
| 10 | EvoMAC | ICLR 2025 | Dynamic | Dynamic | Yes | Coding-Specific |
-
👤 Single-Agent SFT We use the Llama-Factory to conduct training, please refer to the
train/single_agent_SFT.yamlfor training parameters. SFT Dataset will be released in hugginggface. -
👤 Single-Agent RL Comming soon...
-
👥 Multi-Agent SFT Comming soon...
📊 We maintain a leaderboard at covering:
| Category | #Methods | Easy Best Pass@1 | Hard Best Pass@1 |
|---|---|---|---|
| Chat LLMs | 17 | 54.37 % | 19.13 % |
| Reasoning LLMs | 10 | 51.21 % | 22.51 % |
| Multi‑Agent Systems | 10 | - | - |
[20250908] 🎉 Our benchmark is used by Kimi-K2 titter.!
[20250601] 🎉 Release the inference script and docker images for both test split and train split!
[20250522] 📄 Release the preprint version! See the preprint.
If you use SWE‑Dev, please cite:
@article{du2025swedev,
title={SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development},
author={Du, Yaxin and Cai, Yuzhu and Zhou, Yifan and Wang, Cheng and Qian, Yu and Pang, Xianghe and Liu, Qian and Hu, Yue and Chen, Siheng},
journal={arXiv preprint arXiv:2505.16975},
year={2025}
}Code and dataset are released under the Apache 2.0 license. See the LICENSE file for details.
We thanks for the MAS-Lab for contributing the multiagent system inference framework, Llama-Factory and Verl for providing training framework.


