NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition

📰 News

[2025-07-29]: NoCode-bench is now available on Hugging Face Datasets! You can access it here.
[2025-07-18]: We have released NoCode-bench, evaluate your SE Agent here.

📦 Benchmark Overview

NoCode-bench is a benchmark designed to evaluate the ability of Large Language Models (LLMs) to perform no-code feature addition using natural language documentation as input. Unlike prior benchmarks that focus on bug fixing or general issue resolution, NoCode-bench targets a new paradigm where feature development is driven by documentation changes in real-world software projects.

Instances: 634 real-world feature addition tasks across diverse GitHub projects
Format: Each instance contains the documentation change, relevant context files, and a ground truth patch
Subset: Includes a manually verified subset (NoCode-bench-Verified) for high-quality, human-evaluated evaluation

To access NoCode-bench, copy and run the following code:

from datasets import load_dataset
ncbench = load_dataset('NoCode-bench/NoCode-bench_Full', split='test')
ncbench_verified = load_dataset('NoCode-bench/NoCode-bench_Verified', split='test')

🚀 How to Use the Benchmark

Environment Setup

Follow these steps to set up the environment for NoCode-bench:

conda create -n ncb python=3.12
conda activate ncb
pip install -r requirements.txt

NoCode-bench enables reproducible evaluations via Docker, by building the base image (fb_base:dev) and the project image (fb_[repo]:dev) as follows:

cd environment
bash setup_all.sh

NoCode-bench also support instance-level Docker images, which can be built using the following command:

export PYTHONPATH=$PYTHONPATH:$(pwd)
python environment/setup_instances_images.py \
   --bench_tasks NoCode-bench/NoCode-bench_Verified \
   --log_dir logs \
   --max_workers 20

We have also provided a pre-built Docker image for NoCode-bench, which can be pulled from Docker Hub. For repo-level docker images, you can pull them using the following command:

cd environment
bash pull_from_hub.sh # for repo level
python pull_instance_images.py --bench_tasks NoCode-bench/NoCode-bench_Verified  # for instance level

Evaluation

You need to generate the prediction results that meet the following format for easy evaluation

# Output Format
instances = [
  {
    'model_name_or_path': '...',
    'instance_id': '...',
    'model_patch': '...',
  },
  ...
]

Evaluate patch predictions on NoCode-bench Verified with the following command:

export PYTHONPATH=$PYTHONPATH:$(pwd)
python ./evaluation/eval.py \
    --predictions_path ./all_preds.jsonl \  # <path_to_your_predictions>
    --log_dir ./evaluation/logs \ # <path_to_your_log_dir>
    --bench_tasks NoCode-bench/NoCode-bench_Verified \ # <dataset_name>
    --max_workers 110 \ # <number_of_workers>
    --output_file eval_result.txt \ # <path_to_your_output_file>
    --image_level repo \ # <cache_image_level>
    --timeout 600 \ # <timeout_in_seconds>
    --proxy None # <proxy_if_needed>

🔧 How to Reconstruct the Benchmark

You can reproduce or extend NoCode-bench using our 5-step construction pipeline:

Step 1: Project Selection

Select high-quality, actively maintained GitHub repositories

cd repos/
sh collect.sh

Step 2: Instance Collection

Parse release notes to identify real feature addition tasks
Retrieve corresponding PR from GitHub

python construction/collection/collect_[repo].py
python construction/filter_attribute/attribute_filter.py

Step 3: Environment Construction

All involved data and scripts are stored in the environment/ folder
Include related modules, configuration, and dependencies

Step 4: Instance Filtering

Automatically filter out instances that cannot meet our criteria

python construction/filter_execution/execution.py

Step 5: Input Refinement

Supplement missing but essential entity names in the task input.
Mask information that may cause data leakage

python construction/augmentation/augment.py
python construction/augmentation/mask_auto.py

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
construction		construction
doc		doc
environment		environment
evaluation		evaluation
repos		repos
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition

📰 News

📦 Benchmark Overview

🚀 How to Use the Benchmark

Environment Setup

Evaluation

🔧 How to Reconstruct the Benchmark

Step 1: Project Selection

Step 2: Instance Collection

Step 3: Environment Construction

Step 4: Instance Filtering

Step 5: Input Refinement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition

📰 News

📦 Benchmark Overview

🚀 How to Use the Benchmark

Environment Setup

Evaluation

🔧 How to Reconstruct the Benchmark

Step 1: Project Selection

Step 2: Instance Collection

Step 3: Environment Construction

Step 4: Instance Filtering

Step 5: Input Refinement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages