CodeFlowBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on multi-turn, dependency-aware, and iterative code generation tasks. Unlike traditional benchmarks that focus on single-function generation, CodeFlowBench tests a model's ability to maintain context, handle complex dependencies, and evolve code over multiple turns.
The benchmark consists of two subsets:
- CodeFlowBench-Comp(Competitive): Focuses on complex competitive programming problems.
- CodeFlowBench-Repo: Focuses on domain-specific real-world programming problems from Github Repo.
codeflowbench/
├── data/ # Dataset files (JSON)
├── models/ # Local model checkpoints (optional)
├── scripts/ # Bash scripts for running evaluation
├── src/ # Source code for inference and harness
│ ├── harness.py # Evaluation logic
│ └── utils.py # Utility functions
├── requirements.txt # Dependencies for All benchmark
├── requirements_repo.txt # Additional dependencies for Repo benchmark
└── README.md
First, clone the repository and set up the Conda environment:
cd codeflowbench
conda create -n codeflowbench python=3.10
conda activate codeflowbench
Install the dependencies:
# For CodeFlowBench All (Standard Evaluation)
pip install -r requirements.txt
# [Optional] For CodeFlowBench-Repo
# This installs additional libraries required for executing domain-specific code
pip install -r requirements_repo.txt
You can either use Hugging Face model paths directly or place your local model weights inside the models folder.
- Example Path:
models/Llama-3.1-8B-Instruct
Ensure the dataset files are located in the ./data directory. The structure should typically contain:
codeflowbench_comp_test.jsoncodeflowbench_repo.json
We provide convenient Bash scripts to automate the inference and evaluation process. The default scripts use Llama-3.1-8B-Instruct as an example.
Multi-turn Evaluation (Core): Evaluate the model's ability to generate code iteratively with dependencies.
bash scripts/test_multi_turn.sh
Single-turn Evaluation: Evaluate the model in a standard single-turn setting for comparison.
bash scripts/test_single_turn.sh
The process is similar to the evaluation, with the following adjustments:
- Choose
harness_repo.pyandcodeflowbench_repo.jsonin the bash script. - Change the import in
inference.pytoutils_(api)_repo. - Run the bash file.
The evaluation logs and final scores will be saved in the result directory.
Filename Format: {model_name}_{mode}.json
Example: result/Llama-3.1-8B-Instruct_multi_turn.json
Result Content: Each entry contains the generated code, execution logs, and the pass/fail status for each turn.