SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios
✨ News | 🔭 Overview | 🛠️ Quick Start | 📚 Citation | 🙏 Acknowledgments
- [2026-06-06] 🏆 We released the online Leaderboard for SecureVibeBench.
- [2026-04-15] 🤗 We support the usage via Hugging Face Datasets.
- [2026-04-11] 🚀 We released code and data for SecureVibeBench.
- [2026-04-07] 🎉 Our paper has been accepted to ACL 2026 Main Conference.
SecureVibeBench is the first SWE-bench-level benchmark for secure vibe coding of agents, consisting of 105 C/C++ coding tasks sourced from real vulnerabilities (OSS-Fuzz/ARVO) covering various projects.
For each task in SecureVibeBench, we reconstruct the real scenario where a human developer introduced a vulnerability into the codebase, and then ask the agent to implement the same requirements and to see if the agent will also introduce the same vulnerability or not (and maybe new security issues as well).
To comprehensively evaluate the generated code of code agents, we conduct (i) functional correctness evaluation, (ii) PoV (proof-of-vulnerability) based dynamic security evaluation, and (iii) SAST-tool based static security evaluation.
Important
Why SecureVibeBench?
- First SWE-bench-level, peer-reviewed benchmark for secure vibe coding.
- Reconstruct coding scenarios where humans introduced vulnerabilities.
- No other secure coding benchmark considering (i) functional correctness, (ii) PoV-based evaluation, and (iii) SAST tool-based new security issue detection.
SecureVibeBench uses the Hugging Face dataset to store the data. We also keep a local copy of the data in this repository for easy access.
First, set up your API keys by copying the example file and filling in your keys:
cd evaluation
cp .env.example .env
# Edit .env with your API keys
To evaluate one agent supported by a backbone LLM, you can run the following script:
Note
Each instance is equipped with one Docker image pulled from Docker Hub, so please make sure the disk space is enough for these Docker images.
cd evaluation/
bash run.sh <AGENT_NAME> <MODEL_NAME> <INSTANCE_ID> # run a single instance
bash run.sh <AGENT_NAME> <MODEL_NAME> ALL # run all instances of SecureVibeBench
Currently available agents and models (and it is easy to extend to other agent frameworks and LLMs):
AGENT_NAME=(aider openhands sweagent claudecode codex)
MODEL_NAME=(claude-3-7-sonnet-20250219 claude-sonnet-4-5-20250929 gpt-4.1 gpt-5-2025-08-07 deepseek-chat)
If you feel our work is helpful, please consider citing:
@misc{chen2026securevibebenchevaluatingsecurecoding,
title={SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios},
author={Junkai Chen and Huihui Huang and Yunbo Lyu and Junwen An and Jieke Shi and Chengran Yang and Ting Zhang and Haoye Tian and Yikun Li and Zhenhao Li and Xin Zhou and Xing Hu and David Lo},
year={2026},
eprint={2509.22097},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2509.22097},
}Our work cannot be separated from the following excellent works, OSS-Fuzz and ARVO:
@misc{mei2024arvoatlasreproduciblevulnerabilities,
title={ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software},
author={Xiang Mei and Pulkit Singh Singaria and Jordi Del Castillo and Haoran Xi and Abdelouahab and Benchikh and Tiffany Bao and Ruoyu Wang and Yan Shoshitaishvili and Adam Doupé and Hammond Pearce and Brendan Dolan-Gavitt},
year={2024},
eprint={2408.02153},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2408.02153},
}
@article{serebryany2017oss,
title={$\{$OSS-Fuzz$\}$-Google's continuous fuzzing service for open source software},
author={Serebryany, Kostya},
year={2017}
}
