CodeFlowBench: A Multi-turn, Iterative
Benchmark for Complex Code Generation

📖 Introduction

CodeFlowBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) on multi-turn, dependency-aware, and iterative code generation tasks. Unlike traditional benchmarks that focus on single-function generation, CodeFlowBench tests a model's ability to maintain context, handle complex dependencies, and evolve code over multiple turns.

The benchmark consists of two subsets:

CodeFlowBench-Comp(Competitive): Focuses on complex competitive programming problems.
CodeFlowBench-Repo: Focuses on domain-specific real-world programming problems from Github Repo.

📂 Directory Structure

codeflowbench/
├── data/                   # Dataset files (JSON)
├── models/                 # Local model checkpoints (optional)
├── scripts/                # Bash scripts for running evaluation
├── src/                    # Source code for inference and harness
│   ├── harness.py          # Evaluation logic
│   └── utils.py            # Utility functions
├── requirements.txt        # Dependencies for All benchmark
├── requirements_repo.txt # Additional dependencies for Repo benchmark
└── README.md

🔧 Installation

First, clone the repository and set up the Conda environment:

cd codeflowbench

conda create -n codeflowbench python=3.10
conda activate codeflowbench

Install the dependencies:

# For CodeFlowBench All (Standard Evaluation)
pip install -r requirements.txt

# [Optional] For CodeFlowBench-Repo 
# This installs additional libraries required for executing domain-specific code
pip install -r requirements_repo.txt

📋 Preparation

1. Model Preparation

You can either use Hugging Face model paths directly or place your local model weights inside the models folder.

Example Path: models/Llama-3.1-8B-Instruct

2. Data Preparation

Ensure the dataset files are located in the ./data directory. The structure should typically contain:

codeflowbench_comp_test.json
codeflowbench_repo.json

🏃 Quick Start

We provide convenient Bash scripts to automate the inference and evaluation process. The default scripts use Llama-3.1-8B-Instruct as an example.

🔹 CodeFlowBench-Comp

Multi-turn Evaluation (Core): Evaluate the model's ability to generate code iteratively with dependencies.

bash scripts/test_multi_turn.sh

Single-turn Evaluation: Evaluate the model in a standard single-turn setting for comparison.

bash scripts/test_single_turn.sh

🔸 CodeFlowBench-Repo

The process is similar to the evaluation, with the following adjustments:

Choose harness_repo.py and codeflowbench_repo.json in the bash script.
Change the import in inference.py to utils_(api)_repo.
Run the bash file.

📊 Output & Results

The evaluation logs and final scores will be saved in the result directory.

Filename Format: {model_name}_{mode}.json
Example: result/Llama-3.1-8B-Instruct_multi_turn.json
Result Content: Each entry contains the generated code, execution logs, and the pass/fail status for each turn.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
models		models
output		output
result		result
run		run
temp		temp
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
requirements_domain.txt		requirements_domain.txt
test_multi_turn.sh		test_multi_turn.sh
test_single_turn.sh		test_single_turn.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeFlowBench: A Multi-turn, Iterative
Benchmark for Complex Code Generation

📖 Introduction

📂 Directory Structure

🔧 Installation

📋 Preparation

1. Model Preparation

2. Data Preparation

🏃 Quick Start

🔹 CodeFlowBench-Comp

🔸 CodeFlowBench-Repo

📊 Output & Results

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

📖 Introduction

📂 Directory Structure

🔧 Installation

📋 Preparation

1. Model Preparation

2. Data Preparation

🏃 Quick Start

🔹 CodeFlowBench-Comp

🔸 CodeFlowBench-Repo

📊 Output & Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

CodeFlowBench: A Multi-turn, Iterative
Benchmark for Complex Code Generation

Packages