- Paper: SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories
- Website & Leaderboard: https://secrepobench.github.io/
SecRepoBench is a repository-level secure code completion benchmark. It contains 318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 CWEs. Our benchmark can be used to evaluate both standalone LLMs with a context retriever and agent frameworks with access to the entire repository, which gives a comprehensive assessment of different code generation paradigms.
Each code completion task takes a target function with a masked region and the entire repository providing context as inputs to either a standalone LLM with a context retriever or an agent framework, which then generates code to fill the empty region. The generated code is compiled with the full repository and evaluated on two dimensions: correctness using developer-written unit tests and security using Proof-of-Concept exploits from OSS-Fuzz.
For standalone LLM evaluation, SecRepoBench supports three context retrieval methods: BM25, dense-file, and in-file. In the paper, we use BM25 as the default context retriever, which retrieves the top 5 most relevant functions from the repository as the context.
For agent evaluation, we run Aider and OpenHands inside the ARVO container to ensure the environment provides all necessary dependencies to compile the task codebase. This setup gives agents access to the complete repository environment, including all required build systems, dependencies, and testing frameworks. For Claude Code, though we could not run it inside the container due to compatibility issues, we clone the task codebase locally as Claude Code's working directory.
SecRepoBench supports four prompt types: no-security-reminder, sec-generic, sec-specific, and security-policy.
no-security-reminder: this prompt does not give the LLM any reminders to generate secure code.sec-generic: this prompt tells the LLM that it is a security expert, and asks the LLM to ensure that the generated code is secure.sec-specific: this prompt tells the LLM that it is a security expert. It then asks the LLM to ensure that the code does not contain the specific CWE present in the developer-written, pre-patched code, and provides the MITRE description of that CWE.security-policy: this prompt provides the LLM with task-specific instructions generated by GPT-4o on how to avoid the CWE present in the developer-written, pre-patched code. This prompt is based on the optional security policy in SecCodePLT.
In the paper, we use no-security-reminder as the default prompt type to reflect realistic code completion scenarios where developers do not know beforehand what vulnerabilities might be introduced, requiring models to identify and prevent vulnerabilities solely by understanding the task context.
Please check installation methods to install uv on your platform.
Run the following command to install dependencies required by SecRepoBench:
cd SecRepoBench
uv syncTo install dependencies for agent framework ClaudeCode, please run the following commands:
curl -fsSL https://claude.ai/install.sh | bashSecRepoBench requires API keys for the language models you plan to use. Please set the following environment variables:
export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_API_KEY>
export GEMINI_API_KEY=<YOUR_API_KEY>
export TOGETHER_API_KEY=<YOUR_API_KEY>Upzip metadata that would be used during inference or evalutaion:
gunzip -k report.json.gz sample_metadata.json.gzThese two files contain necessary metadata required for all benchmark tasks.
To run inference using SecRepoBench:
uv run run_inference.py \
--agents [YOUR_AGENT_NAMES] \
--model-names [YOUR_MODEL_NAMES] \
--prompt-types [YOUR_PROMPT_TYPES] \
--context-types [YOUR_CONTEXT_TYPES] \
[--rerun]- Agent names:
none(Without using agent framework)aideropenhandsclaudecode
- Model names: Defined in
assets/constants.py - Prompt types:
no-security-remindersec-genericsec-specificsecurity-policy
- Context types: (This option is disabled while using agent framework)
BM25dense-filein-file
📁 Code completions are saved in the completions/ directory.
📁 Trajectories are saved in the .{agent}/ (e.g., .openhands/) directory.
To evaluate the model completions:
uv run run_eval.py \
--agents [YOUR_AGENT_NAMES] \
--model-names [YOUR_MODEL_NAMES] \
--prompt-types [YOUR_PROMPT_TYPES] \
--context-types [YOUR_CONTEXT_TYPES] \
[--rerun]📁 Evaluation results are saved in the eval_results/ directory.
SecRepoBench allows you to test new models and agent frameworks. The process differs between different setups.
To add a new model for standalone evaluation, you need to register it in the ./assets/constants.py file:
1. Add your model name and its corresponding snapshot/version to the MODELS dictionary.
MODELS = {
'gpt-5': 'gpt-5-2025-08-07',
...
}2. Configure model-specific setting
For example, add an OpenAI model snapshot/version to the OPENAI_REASONING_MODELS list to enable its reasoning ability.
OPENAI_REASONING_MODELS = [
'gpt-5-2025-08-07',
...
]Meanwhile, please check ./tools/patcher.py if any provider-specific logic needs to be updated in different patchers to support your model's API format or special requirements.
If the new agent can be configured correctly and run smoothly inside the ARVO container (Ubuntu 16.04) with all necessary dependencies:
1. Add your agent's installation commands to the AGENT_INSTALL_COMMANDS dictionary in ./assets/constants.py
These commands will be executed in the Docker container during setup. Remember to add an \n at the end of each command.
AGENT_INSTALL_COMMANDS = {
"aider": " uv pip install aider-chat==0.86.1\n",
...
}2. Implement a harness under ./harnesses/ following the format of existing harnesses like openhands_harness.py or aider_harness.py
Your harness should contain a standard interface consisting of an agent class with a run() method that does the inference process.
# Example harness
class YourAgentRunner:
def __init__(self, model_name, prompt_type):
# Initialize your agent with model and configurations
self.model_name = model_name
self.prompt_type = prompt_type
def run(self, system_prompt, repo_folder, changed_file):
# Execute agent workflow
# Return diff and completed code of the target file
return diff, content
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--model-name', type=str)
parser.add_argument('--system-prompt', type=str)
parser.add_argument('--repo-folder', type=str)
parser.add_argument('--changed-file', type=str)
...
args = parser.parse_args()
client = YourAgentRunner(args.model_name, args.prompt_type)
diff, response = client.run(
args.system_prompt, args.repo_folder, args.changed_file)
# Save outputs
Path(f'/diff/your-agent-{args.model_alias}-...diff').write_text(diff)
Path(f'/completions/your-agent-{args.model_alias}-...txt').write_text(response)
if __name__ == "__main__":
main()If the agent is unable to run inside the ARVO container due to dependency issues:
1. Create a local harness under ./harnesses/ following claudecode_harness.py
The harness should re-init the repository (removing .git and running git init) to avoid the agent accessing old commits.
2. Add a patcher class to ./tools/patcher.py
Create a patcher class that handles both inference and caching, following the pattern of ClaudeCodePatcher.
3. Initialize your patcher in ./tools/preprocessor.py
Add the patcher to process_id() function to ensure proper workflow.
@article{shen2025secrepobench,
title={SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories},
author={Shen, Chihao and Dilgren, Connor and Chiniya, Purva and Griffith, Luke and Ding, Yu and Chen, Yizheng},
journal={arXiv preprint arXiv:2504.21205},
year={2025}
}