- [2025-07-29]: NoCode-bench is now available on Hugging Face Datasets! You can access it here.
- [2025-07-18]: We have released NoCode-bench, evaluate your SE Agent here.
NoCode-bench is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to perform no-code feature addition using natural language documentation as input. Unlike prior benchmarks that focus on bug fixing or general issue resolution, NoCode-bench targets a new paradigm where feature development is driven by documentation changes in real-world software projects.
- Instances: 634 real-world feature addition tasks across diverse GitHub projects
- Format: Each instance contains the documentation change, relevant context files, and a ground truth patch
- Subset: Includes a manually verified subset (NoCode-bench-Verified) for high-quality, human-evaluated evaluation
To access NoCode-bench, copy and run the following code:
from datasets import load_dataset
ncbench = load_dataset('NoCode-bench/NoCode-bench_Full', split='test')
ncbench_verified = load_dataset('NoCode-bench/NoCode-bench_Verified', split='test')Follow these steps to set up the environment for NoCode-bench:
conda create -n ncb python=3.12
conda activate ncb
pip install -r requirements.txtNoCode-bench enables reproducible evaluations via Docker, by building the base image (fb_base:dev) and the project image (fb_[repo]:dev) as follows:
cd environment
bash setup_all.shNoCode-bench also support instance-level Docker images, which can be built using the following command:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python environment/setup_instances_images.py \
--bench_tasks NoCode-bench/NoCode-bench_Verified \
--log_dir logs \
--max_workers 20We have also provided a pre-built Docker image for NoCode-bench, which can be pulled from Docker Hub. For repo-level docker images, you can pull them using the following command:
cd environment
bash pull_from_hub.sh # for repo level
python pull_instance_images.py --bench_tasks NoCode-bench/NoCode-bench_Verified # for instance levelYou need to generate the prediction results that meet the following format for easy evaluation
# Output Format
instances = [
{
'model_name_or_path': '...',
'instance_id': '...',
'model_patch': '...',
},
...
]Evaluate patch predictions on NoCode-bench Verified with the following command:
export PYTHONPATH=$PYTHONPATH:$(pwd)
python ./evaluation/eval.py \
--predictions_path ./all_preds.jsonl \ # <path_to_your_predictions>
--log_dir ./evaluation/logs \ # <path_to_your_log_dir>
--bench_tasks NoCode-bench/NoCode-bench_Verified \ # <dataset_name>
--max_workers 110 \ # <number_of_workers>
--output_file eval_result.txt \ # <path_to_your_output_file>
--image_level repo \ # <cache_image_level>
--timeout 600 \ # <timeout_in_seconds>
--proxy None # <proxy_if_needed>You can reproduce or extend NoCode-bench using our 5-step construction pipeline:
- Select high-quality, actively maintained GitHub repositories
cd repos/
sh collect.sh- Parse release notes to identify real feature addition tasks
- Retrieve corresponding PR from GitHub
python construction/collection/collect_[repo].py
python construction/filter_attribute/attribute_filter.py- All involved data and scripts are stored in the
environment/folder - Include related modules, configuration, and dependencies
- Automatically filter out instances that cannot meet our criteria
python construction/filter_execution/execution.py- Supplement missing but essential entity names in the task input.
- Mask information that may cause data leakage
python construction/augmentation/augment.py
python construction/augmentation/mask_auto.py
