Skip to content

ai-sec-lab/SecRepoBench

Repository files navigation

🛡️ SecRepoBench

Links:

📝 Overview

SecRepoBench is a repository-level secure code completion benchmark. It contains 318 code completion tasks obtained from 27 popular GitHub C/C++ repositories covering 15 CWEs. Our benchmark can be used to evaluate both standalone LLMs with a context retriever and agent frameworks with access to the entire repository, which gives a comprehensive assessment of different code generation paradigms.

💡 Framework

Framework diagram

Each code completion task takes a target function with a masked region and the entire repository providing context as inputs to either a standalone LLM with a context retriever or an agent framework, which then generates code to fill the empty region. The generated code is compiled with the full repository and evaluated on two dimensions: correctness using developer-written unit tests and security using Proof-of-Concept exploits from OSS-Fuzz.

For standalone LLM evaluation, SecRepoBench supports three context retrieval methods: BM25, dense-file, and in-file. In the paper, we use BM25 as the default context retriever, which retrieves the top 5 most relevant functions from the repository as the context.

For agent evaluation, we run Aider and OpenHands inside the ARVO container to ensure the environment provides all necessary dependencies to compile the task codebase. This setup gives agents access to the complete repository environment, including all required build systems, dependencies, and testing frameworks. For Claude Code, though we could not run it inside the container due to compatibility issues, we clone the task codebase locally as Claude Code's working directory.

SecRepoBench supports four prompt types: no-security-reminder, sec-generic, sec-specific, and security-policy.

  • no-security-reminder: this prompt does not give the LLM any reminders to generate secure code.
  • sec-generic: this prompt tells the LLM that it is a security expert, and asks the LLM to ensure that the generated code is secure.
  • sec-specific: this prompt tells the LLM that it is a security expert. It then asks the LLM to ensure that the code does not contain the specific CWE present in the developer-written, pre-patched code, and provides the MITRE description of that CWE.
  • security-policy: this prompt provides the LLM with task-specific instructions generated by GPT-4o on how to avoid the CWE present in the developer-written, pre-patched code. This prompt is based on the optional security policy in SecCodePLT.

In the paper, we use no-security-reminder as the default prompt type to reflect realistic code completion scenarios where developers do not know beforehand what vulnerabilities might be introduced, requiring models to identify and prevent vulnerabilities solely by understanding the task context.

⚙️ Configuration

1. Install uv

Please check installation methods to install uv on your platform.

2. Install dependencies

Run the following command to install dependencies required by SecRepoBench:

cd SecRepoBench
uv sync

To install dependencies for agent framework ClaudeCode, please run the following commands:

curl -fsSL https://claude.ai/install.sh | bash

3. Set Environment Variables

SecRepoBench requires API keys for the language models you plan to use. Please set the following environment variables:

export OPENAI_API_KEY=<YOUR_API_KEY>
export ANTHROPIC_API_KEY=<YOUR_API_KEY>
export GEMINI_API_KEY=<YOUR_API_KEY>
export TOGETHER_API_KEY=<YOUR_API_KEY>

4. Extract metadata files

Upzip metadata that would be used during inference or evalutaion:

gunzip -k report.json.gz sample_metadata.json.gz

These two files contain necessary metadata required for all benchmark tasks.

🚀 Running Inference

To run inference using SecRepoBench:

uv run run_inference.py \
  --agents [YOUR_AGENT_NAMES] \
  --model-names [YOUR_MODEL_NAMES] \
  --prompt-types [YOUR_PROMPT_TYPES] \
  --context-types [YOUR_CONTEXT_TYPES] \
  [--rerun]
  • Agent names:
    • none (Without using agent framework)
    • aider
    • openhands
    • claudecode
  • Model names: Defined in assets/constants.py
  • Prompt types:
    • no-security-reminder
    • sec-generic
    • sec-specific
    • security-policy
  • Context types: (This option is disabled while using agent framework)
    • BM25
    • dense-file
    • in-file

📁 Code completions are saved in the completions/ directory.

📁 Trajectories are saved in the .{agent}/ (e.g., .openhands/) directory.

📊 Running Evaluation

To evaluate the model completions:

uv run run_eval.py \
  --agents [YOUR_AGENT_NAMES] \
  --model-names [YOUR_MODEL_NAMES] \
  --prompt-types [YOUR_PROMPT_TYPES] \
  --context-types [YOUR_CONTEXT_TYPES] \
  [--rerun]

📁 Evaluation results are saved in the eval_results/ directory.

🔧 Testing New Models and Agents

SecRepoBench allows you to test new models and agent frameworks. The process differs between different setups.

Standalone LLMs

To add a new model for standalone evaluation, you need to register it in the ./assets/constants.py file:

1. Add your model name and its corresponding snapshot/version to the MODELS dictionary.

MODELS = {
    'gpt-5': 'gpt-5-2025-08-07',
    ...
}

2. Configure model-specific setting

For example, add an OpenAI model snapshot/version to the OPENAI_REASONING_MODELS list to enable its reasoning ability.

OPENAI_REASONING_MODELS = [
  'gpt-5-2025-08-07',
  ...
]

Meanwhile, please check ./tools/patcher.py if any provider-specific logic needs to be updated in different patchers to support your model's API format or special requirements.

Agents

Option 1: Agent Running in Docker

If the new agent can be configured correctly and run smoothly inside the ARVO container (Ubuntu 16.04) with all necessary dependencies:

1. Add your agent's installation commands to the AGENT_INSTALL_COMMANDS dictionary in ./assets/constants.py

These commands will be executed in the Docker container during setup. Remember to add an \n at the end of each command.

AGENT_INSTALL_COMMANDS = {
    "aider": "  uv pip install aider-chat==0.86.1\n",
    ...
}

2. Implement a harness under ./harnesses/ following the format of existing harnesses like openhands_harness.py or aider_harness.py

Your harness should contain a standard interface consisting of an agent class with a run() method that does the inference process.

# Example harness
class YourAgentRunner:
    def __init__(self, model_name, prompt_type):
        # Initialize your agent with model and configurations
        self.model_name = model_name
        self.prompt_type = prompt_type
    
    def run(self, system_prompt, repo_folder, changed_file):
        # Execute agent workflow
        # Return diff and completed code of the target file
        return diff, content

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--model-name', type=str)
    parser.add_argument('--system-prompt', type=str)
    parser.add_argument('--repo-folder', type=str)
    parser.add_argument('--changed-file', type=str)
    ...
    
    args = parser.parse_args()
    client = YourAgentRunner(args.model_name, args.prompt_type)
    diff, response = client.run(
        args.system_prompt, args.repo_folder, args.changed_file)
    
    # Save outputs
    Path(f'/diff/your-agent-{args.model_alias}-...diff').write_text(diff)
    Path(f'/completions/your-agent-{args.model_alias}-...txt').write_text(response)

if __name__ == "__main__":
    main()

Option 2: Agent Running Locally

If the agent is unable to run inside the ARVO container due to dependency issues:

1. Create a local harness under ./harnesses/ following claudecode_harness.py

The harness should re-init the repository (removing .git and running git init) to avoid the agent accessing old commits.

2. Add a patcher class to ./tools/patcher.py

Create a patcher class that handles both inference and caching, following the pattern of ClaudeCodePatcher.

3. Initialize your patcher in ./tools/preprocessor.py

Add the patcher to process_id() function to ensure proper workflow.

📖 Citation

@article{shen2025secrepobench,
  title={SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories},
  author={Shen, Chihao and Dilgren, Connor and Chiniya, Purva and Griffith, Luke and Ding, Yu and Chen, Yizheng},
  journal={arXiv preprint arXiv:2504.21205},
  year={2025}
}

About

Implementation of SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published