Note
BouncerBench Multilingual with additional tasks to cover more programming languages and ticket types will be released soon!
Important
To make a submission to the BouncerBench Leaderboard, please follow this README.
BouncerBench is a benchmark for evaluating the ability of Large Language Models (LLMs) to abstain when presented with ambiguous tasks or when a sufficiently accurate response cannot be provided. Trust and reliability are critical to the success of AI agents in software engineering tasks, and BouncerBench introduces a notion of "bouncers" that could be applied to any autonomous ticket resolution system.
We will use venv to create a virtual environment for this project. Make sure you have python and pip installed, the following steps assume you have Python 3.10 or later and a Linux environment.
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
source venv/bin/activate
# Install the required packages
pip install -r requirements.txtCreate a .env file in the root directory of the repository with the following content:
ENDPOINT_URL="https://xyz.openai.azure.com/"
AZURE_OPENAI_API_KEY="xyz"
OPENROUTER_API_KEY="sk-xyz"
OLLAMA_ENDPOINT="http://xyz:11434"Make sure to replace the values with your actual API keys and endpoint URLs. If you are not using a provider, you can leave the placeholders as they are.
npm install -g @openai/codexEdit the codex config file located at ~/.config/codex/config.json to include your OpenAI API key:
In our case since we use Azure OpenAI, the config file would look like this:
{
"model": "o4-mini",
"provider": "azure",
"providers": {
"azure": {
"name": "AzureOpenAI",
"baseURL": "https://xyz.azure.com/openai",
"envKey": "AZURE_OPENAI_API_KEY"
}
}
}Please refer to the Codex CLI documentation for more details on how to configure the CLI.
This can take a few hours / reasonable amount of disk space, you may skip this step and use the pre-fetched data directly. Download the file from our release:
```bash
# Download the pre-fetched all_patches.csv (987MB)
wget https://github.com/uw-swag/BouncerBench/releases/download/paper/all_patches.csv -O ./data/all_patches.csv
```
The file contains all patches from every submission to SWE-Bench considered in the paper (Until April 4th 2025)
NOTE: The following steps will overwrite the "./data/all_patches.csv" file with newer data than we considered (if it exists).
SWE-Bench submissions are saved to a public s3 bucket. You need to configure AWS CLI with your credentials to access the bucket as described in the original SWE-Bench Experiments Repo
To fetch submissions to all SWE-Bench Leaderboards, run the following commands:
# Clone SWE-Bench Experiments repository to ../experiments
git clone https://github.com/SWE-bench/experiments.git ../experiments
cd ../experiments
# Fetch all submissions (requires AWS CLI configured)
python -m analysis.download_logs evaluation/verified --skip_existing --only_log
python -m analysis.download_logs evaluation/lite --skip_existing --only_log
python -m analysis.download_logs evaluation/test --skip_existing --only_log
# switch back to the root of this repository
cd -Fetching all submissions can take a while and consumes around 5.4GB of disk space (excluding trajectories using --only_log).
Now we can recreate ./data/all_patches.csv file by collecting required data from the fetched submissions at "../experiments":
python 0_process_experiments.pyThe annotation data is already included in this repository, but if you want to fetch the latest annotations, you can do so from OpenAI SWE-Bench Annotations. After downloading and extracting the zip file, move "ensembled_annotations_public.csv" to the "./data" directory.
Run all cells in 1_create_input_dataset.ipynb to construct the BouncerBench dataset. This notebook will create ./dataset/input_bouncer.csv file, which contains the input tasks for BouncerBench.
Run all cells in 2_create_output_dataset.ipynb to construct the output tasks for BouncerBench. This notebook will create ./dataset/random_sample_bouncer.csv file, which contains the output tasks for BouncerBench.
Run all cells in 3_create_bouncerbench_lite.ipynb to construct the BouncerBench Lite dataset. This notebook will create ./dataset/bouncer_bench_lite.csv file, which contains the BouncerBench Lite dataset.
Please update the model_list variable in 4_simple_bouncer_experiments.py to reflect the providers you want to use for the experiments. The default is set to use the same configuration we ran experiments with (Azure for OpenAI models, Ollama for open models, and OpenRouter for Anthropic models).
Note: The exact prompts used can be found in the prompts directory.
The script can be run as follows:
# Run input bouncer experiments
python 4_simple_bouncer_experiments.py --input
# Run output bouncer experiments
python 4_simple_bouncer_experiments.py --output
# add the --codex flag to run the experiments using the Agentic Bouncer (Codex CLI)
python 4_simple_bouncer_experiments.py --input --codex
python 4_simple_bouncer_experiments.py --output --codexThe outputs will be saved to ./outputs/ directory. We have already included the outputs for the core experiments in this repository, so you can skip running these commands if you just want to view the results.
Naming Scheme for the output files: input_bouncer_{USED_MODEL}.json: Contains traces for the input bouncer experiments. output_bouncer_{USED_MODEL}.json: Contains traces for the output bouncer experiments.
To run the agent experiments, you first need to create the corresponding output bouncer datasets (with patches only from a certain submission).
python 5_create_agent_output_dataset.pyThis shoudld create the ./dataset/agent_output/ directory with the following files:
.
─ agent_output
│ ─ OpenHands_bouncer.csv
│ ─ amazon-q_bouncer.csv
│ └── sweagent_bouncer.csvYou can then run the agent experiments as follows:
python 4_simple_bouncer_experiments.py --agentsThis will run the experiments for the custom agents and save the outputs to ./outputs/ directory with the following naming scheme:
agent_output_bouncer_codex_{AGENT_NAME}.json
Note: for this experiment, the best input bouncer
selected_input_bouncer = "outputs/input_bouncer_o4-mini.json"and the best output bouncer (Codex CLI Agent) is used. Only the output bouncing needs to be rerun due to the different patches to be evaluated.
We have included several scripts to analyze the results of the experiments and visualize the data.
─ agent_use.py: Gives insights about tool calling usage by the agents. Also looks at flips in decisions between o4-mini and codex experiments.
─ agreement.py: Analyzes the agreement between different models and the human annotators for input bouncing.
─ construct_sankey.py: Creates the Sankey diagram in Fig.4 to visualize the flow of BouncerBench Lite tickets with both input and output bouncing.
─ impact_of_issue_length.py: Used to construct Fig.2
─ impact_of_patch_length.py: Used to construct Fig.3
─ process_agents.py: Used to collect data for Table IV from evaluatable instances in each of the 3 submissions analyzed.
─ process_results.py: Computes classwise Precision, Recall, F1 for the input and output bouncing experiments.
