SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

✨ News | 🔭 Overview | 🛠️ Quick Start | 📚 Citation | 🙏 Acknowledgments

✨ News

[2026-06-06] 🏆 We released the online Leaderboard for SecureVibeBench.
[2026-04-15] 🤗 We support the usage via Hugging Face Datasets.
[2026-04-11] 🚀 We released code and data for SecureVibeBench.
[2026-04-07] 🎉 Our paper has been accepted to ACL 2026 Main Conference.

🔭 Overview

SecureVibeBench is the first SWE-bench-level benchmark for secure vibe coding of agents, consisting of 105 C/C++ coding tasks sourced from real vulnerabilities (OSS-Fuzz/ARVO) covering various projects.

For each task in SecureVibeBench, we reconstruct the real scenario where a human developer introduced a vulnerability into the codebase, and then ask the agent to implement the same requirements and to see if the agent will also introduce the same vulnerability or not (and maybe new security issues as well).

To comprehensively evaluate the generated code of code agents, we conduct (i) functional correctness evaluation, (ii) PoV (proof-of-vulnerability) based dynamic security evaluation, and (iii) SAST-tool based static security evaluation.

Important

Why SecureVibeBench?

First SWE-bench-level, peer-reviewed benchmark for secure vibe coding.
Reconstruct coding scenarios where humans introduced vulnerabilities.
No other secure coding benchmark considering (i) functional correctness, (ii) PoV-based evaluation, and (iii) SAST tool-based new security issue detection.

🛠️ Quick Start

SecureVibeBench uses the Hugging Face dataset to store the data. We also keep a local copy of the data in this repository for easy access.

First, set up your API keys by copying the example file and filling in your keys:

cd evaluation
cp .env.example .env
# Edit .env with your API keys

To evaluate one agent supported by a backbone LLM, you can run the following script:

Note

Each instance is equipped with one Docker image pulled from Docker Hub, so please make sure the disk space is enough for these Docker images.

cd evaluation/
bash run.sh <AGENT_NAME> <MODEL_NAME> <INSTANCE_ID> # run a single instance
bash run.sh <AGENT_NAME> <MODEL_NAME> ALL            # run all instances of SecureVibeBench

Currently available agents and models (and it is easy to extend to other agent frameworks and LLMs):

AGENT_NAME=(aider openhands sweagent claudecode codex)
MODEL_NAME=(claude-3-7-sonnet-20250219 claude-sonnet-4-5-20250929 gpt-4.1 gpt-5-2025-08-07 deepseek-chat)

📚 Citation

If you feel our work is helpful, please consider citing:

@misc{chen2026securevibebenchevaluatingsecurecoding,
      title={SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios}, 
      author={Junkai Chen and Huihui Huang and Yunbo Lyu and Junwen An and Jieke Shi and Chengran Yang and Ting Zhang and Haoye Tian and Yikun Li and Zhenhao Li and Xin Zhou and Xing Hu and David Lo},
      year={2026},
      eprint={2509.22097},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2509.22097}, 
}

🙏 Acknowledgments

Our work cannot be separated from the following excellent works, OSS-Fuzz and ARVO:

@misc{mei2024arvoatlasreproduciblevulnerabilities,
      title={ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software}, 
      author={Xiang Mei and Pulkit Singh Singaria and Jordi Del Castillo and Haoran Xi and Abdelouahab and Benchikh and Tiffany Bao and Ruoyu Wang and Yan Shoshitaishvili and Adam Doupé and Hammond Pearce and Brendan Dolan-Gavitt},
      year={2024},
      eprint={2408.02153},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2408.02153}, 
}

@article{serebryany2017oss,
  title={$\{$OSS-Fuzz$\}$-Google's continuous fuzzing service for open source software},
  author={Serebryany, Kostya},
  year={2017}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
data		data
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

✨ News

🔭 Overview

🛠️ Quick Start

📚 Citation

🙏 Acknowledgments

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

✨ News

🔭 Overview

🛠️ Quick Start

📚 Citation

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages