This repository contains the benchmark data for LiveRepoReflection, a project designed to evaluate the ability of Code Large Language Models (LLMs) to understand, modify, and reflect upon code within a multi-file repository context.
For the full evaluation framework, data generation pipeline, and research paper, please visit the main project repository: LiveRepoReflection/LiveRepoReflection-Project.
The benchmark data is organized by programming language. Each language directory contains a set of test cases, where each test case is a self-contained programming problem within its own directory.
We provide directory structure example for six programming language.
The directory structure is as follows:
.
├── cpp/
│ └── exercises/practice/
│ └── <uuid_id>/<test_case_id>/
│ ├── .docs/
│ ├── .meta/
│ ├── <source_files>.cpp
│ └── <test_files>.cpp
├── go/
│ └── exercises/practice/
│ └── <uuid_id>/<test_case_id>/
│ ├── .docs/
│ ├── .meta/
│ ├── <source_files>.go
│ └── <test_files>.go
├── java/
│ └── exercises/practice/
│ └── <uuid_id>/<test_case_id>/
│ └── src/
│ ├── main/java/
│ └── test/java/
├── javascript/
│ └── exercises/practice/
│ └── <uuid_id>/<test_case_id>/
│ ├── .docs/
│ ├── .meta/
│ ├── <source_files>.js
│ └── <test_files>.js
├── python/
│ └── exercises/practice/
│ └── <uuid_id>/<test_case_id>/
│ ├── .docs/
│ ├── .meta/
│ ├── <source_files>.py
│ └── <test_files>.py
└── rust/
└── exercises/practice/
└── <uuid_id>/<test_case_id>/
├── .docs/
├── .meta/
└── src/
├── lib.rs
└── main.rs
Each <test_case_id> directory represents a unique problem and contains:
- Source Files: One or more source code files (e.g.,
network_distance.py) that contain the code to be analyzed or modified by the LLM. - Test Files: Corresponding unit test files (e.g.,
network_distance_test.py) that define the correctness criteria for the task. The LLM's goal is to modify the source files so that these tests pass. - .docs/: (Optional) Directory containing documentation or hints related to the problem.
- .meta/: (Optional) Directory containing metadata about the test case.
To run the evaluation using this data, please use the automated framework provided in the LiveRepoReflection-Project repository. Please refer to the README.md in the main project repository for detailed instructions on running the benchmarks.
If you use LiveRepoReflection in your research, please cite our paper:
@misc{zhang2025LiveRepoReflection,
title={Turning the Tide: Repository-based Code Reflection},
author={Wei Zhang and Jian Yang and Jiaxi Yang and Ya Wang and Zhoujun Li and Zeyu Cui and Binyuan Hui and Junyang Lin},
year={2025},
eprint={2507.09866},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2507.09866},
}