Language: English | 中文见 README_zh.md
🎉 Accepted to CVPR 2026
Official implementation of the paper:
Exploring Spatial Intelligence from a Generative Perspective CVPR 2026 [Paper] [arXiv] [Project Page]
GSI-Bench evaluates the ability of generative models to understand and manipulate 3D spatial relationships in indoor scenes.
| Metric | Full Name | What It Measures |
|---|---|---|
| IC | Instruction Compliance | Does the output actually perform the requested spatial operation? |
| SA | Spatial Accuracy | Is the 3D displacement, rotation, or scale close to the ground-truth geometry? |
| AC | Appearance Consistency | Are object identity, category, and appearance preserved after editing? |
| EL | Edit Locality | Is the rest of the scene left untouched outside the intended region? |
If you only want to evaluate your model on GSI-Bench, go directly to Evaluation.
Steps 1 and 2 document how we constructed the benchmark data. They are open-sourced for transparency and reproducibility, but are not required for running evaluations.
GSI-Bench/
├── evaluation/ # Evaluation framework (IC / SA / EL / AC) ← start here
├── robothor/ # [Optional] Data generation pipeline 1: RoboTHOR indoor scenes
├── mesatask/ # [Optional] Data generation pipeline 2: MesaTask tabletop scenes
├── paper/ # Paper PDF
└── tests/ # Unit & integration tests
What you can reproduce:
| Goal | Needs | Doc |
|---|---|---|
| Run IC/SA/EL/AC on your edited images | Eval datasets + weights + eval/<model>/ layout |
This section |
| Reproduce paper BAGEL × fine scores | Download bagel_example/ (~265MB, not in Git) |
REPRODUCE_BAGEL_RESULTS.en.md |
| Regenerate BAGEL images | External BAGEL repo (not full stack here) | Same doc, Section 8 |
conda create -n gsi-eval python=3.10 -y
conda activate gsi-eval
cd evaluation
# Install PyTorch matching your CUDA version (example: CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
# Install mmcv with C++ ops
pip install -U openmim && mim install mmcv
# Install remaining dependencies
pip install -r requirements.txt
# Optional: build GroundingDINO for text-prompt detection
pip install -e ./src/groundingdino --no-build-isolationInstall the Hugging Face CLI first if needed:
pip install -U huggingface_hub| Weight | Size | Source |
|---|---|---|
other_exp_ckpt.pth (DetAny3D) |
~500MB | OpenDriveLab/DetAny3D |
sam_vit_h_4b8939.pth (SAM ViT-H) |
~2.4GB | Meta AI |
dinov2_vitl14_pretrain.pth (DINOv2) |
~1.1GB | Meta AI |
groundingdino_swinb_cogcoor.pth (optional) |
~690MB | IDEA-Research |
step_10000_nocfg/ema.safetensors (BAGEL fine-tune, optional generation checkpoint) |
~28GB | GSI-Bench/bagel_finetune_step_10000_nocfg |
After the checkpoint file is present in the model repo, download it with:
hf download GSI-Bench/bagel_finetune_step_10000_nocfg \
step_10000_nocfg/ema.safetensors \
--local-dir bagel_finetune_step_10000_nocfgPlace all weights in one directory, then run:
bash prepare_weights.sh <path_to_weight_directory>
# Creates symlinks under checkpoints/ and GroundingDINO/weights/Download the four GSI-Bench evaluation dataset archives from Hugging Face:
hf download GSI-Bench/GSI-Bench \
fine_dataset.zip mesatask_dataset.zip bathroom_dataset.zip robothor_dataset.zip \
--repo-type dataset --local-dir GSI-BenchThen prepare the dataset links:
cd evaluation
bash prepare_datasets.sh ../GSI-Bench
# Creates symlinks: fine_dataset/ mesatask_dataset/ bathroom_dataset/ robothor_dataset/The dataset page also includes a small preview split for the Hugging Face Dataset Viewer: https://huggingface.co/datasets/GSI-Bench/GSI-Bench.
Your model should produce edited images following the naming convention:
eval/<model_name>/generated_images_fine/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_mesatask/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_bathroom/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_robothor/<img_id>_edit_<query_id>.png
examples/inference.py is a reference skeleton for BAGEL-style outputs; full generation requires the BAGEL codebase. To reproduce our published BAGEL numbers without regenerating images, download bagel_example/ and follow evaluation/REPRODUCE_BAGEL_RESULTS.en.md.
cd evaluation
export PYTHONPATH=$PWD:$PYTHONPATH
# IC / SA / EL evaluation (iterates all models × all datasets)
bash eval.sh
# (Optional) MLLM-based AC scoring — requires vLLM + Qwen3-VL (see evaluation/requirements-mllm.txt)
cd mllm_eval
pip install -r ../requirements-mllm.txt # plus vllm for your CUDA build
bash eval_infer.sh <qwen3_vl_model_path> default <port>
# Writes predictions to mllm_eval/infer_results/
cd ..
# Aggregate all metrics into a final report
python -m eval.aggregate \
--root-dir ./eval \
--output-dir ./eval_results \
--mllm-eval-dir ./mllm_eval/infer_results
cd .. # back to repo rootOutput: eval_results/ with per-model, per-dataset JSON files containing IC/SA/EL/AC scores.
See evaluation/eval/README.md for detailed input format and troubleshooting.
The following two pipelines document how we constructed the GSI-Bench data. They are not needed for evaluation — the evaluation datasets are provided as downloads above.
Environment:
conda create -n gsi-robothor python=3.10 -y
conda activate gsi-robothor
pip install -r robothor/requirements.txt
# Dependencies: ai2thor>=5.0.0, numpy, Pillow, matplotlib
# AI2-THOR downloads scene assets automatically on first run (~2GB)
# Requires: NVIDIA GPU + CloudRendering (headless) or X server (display)Generate data:
cd robothor
# 1) Generate base views + camera-relative commands for ALL 60 training scenes
# Output: data/outputs/train/with_physics/
bash scripts/generate_train.sh
# 2) Generate additional command types (requires pregenerated views from step 1)
bash scripts/generate_train_object.sh # object-relative positioning
bash scripts/generate_train_rotate.sh # rotation commands
bash scripts/generate_train_receptacle.sh # receptacle placement
bash scripts/generate_train_spatial_remove.sh # spatial removal
bash scripts/generate_train_agent_camera.sh # agent camera movement
# 3) Generate validation data
bash scripts/generate_val_agent_camera.sh
cd .. # back to repo rootOutput: data/outputs/{train,val}/ with JSONL records + RGB/depth/segmentation images per view per command.
Timing: ~2–5 min per scene depending on GPU. Full 60 scenes: several hours.
See robothor/README.md for details.
Environment:
conda create -n gsi-mesatask python=3.10 -y
conda activate gsi-mesatask
pip install -r mesatask/requirement.txt
# For inference (optional): pip install torch torchvision
# For rendering (optional): download Blender 4.3+ from https://www.blender.org/download/
# For physical optimization (optional): conda install -c conda-forge drakeDownload MesaTask-10K dataset:
cd mesatask
git lfs install
git clone https://huggingface.co/datasets/InternRobotics/MesaTask-10K MesaTask-10K
# Prepare asset library (from dataset archives)
cd MesaTask-10K/Assets_library_archive
cat Assets_library_backup.tar.gz.* > Assets_library_merged.tar.gz
tar -xzvf Assets_library_merged.tar.gz -C ../Assets_library/
cd ../..Generate data:
cd mesatask
# 1) Generate atomic transforms (move, rotate, scale)
python generate_atomic_transforms.py \
--input-dir MesaTask-10K/Layout_info \
--asset-annotation MesaTask-10K/Asset_annotation.json \
--output-dir transformed_layouts \
--num-variants 10 --seed 42
# 2) Render all layouts (requires Blender)
python dataset/vis_batch.py transformed_layouts \
--output_dir dataset/vis_final --parallel 4
# 3) Assemble image-editing dataset
python organize_image_editing_dataset.py \
--transformed-dir transformed_layouts \
--vis-dir dataset/vis_final \
--output-dir dataset/image_editing_dataset
cd .. # back to repo rootTiming: Step 1 takes ~10 min for 10K scenes. Step 2 (rendering) depends on machine and parallelism.
See mesatask/README.md for details.
git clone <this-repo-url> GSI-Bench && cd GSI-Bench
# Run tests (no GPU or data needed)
pip install pytest
python -m pytest tests/ -v # 43 tests should pass| Component | Python | GPU | Conda Env |
|---|---|---|---|
| tests/ | 3.8+ | No | any |
| evaluation/ | 3.10 | NVIDIA (DetAny3D) | gsi-eval |
| robothor/ | 3.10 | NVIDIA (CloudRendering) | gsi-robothor |
| mesatask/ | 3.10 | Optional | gsi-mesatask |
@article{zhu2026exploring,
title={Exploring Spatial Intelligence from a Generative Perspective},
author={Zhu, Muzhi and Jiang, Shunyao and Zheng, Huanyi and Luo, Zekai and Zhong, Hao and Li, Anzhou and Wang, Kaijun and Rong, Jintao and Liu, Yang and Chen, Hao and Lin, Tao and Shen, Chunhua},
journal={arXiv preprint arXiv:2604.20570},
year={2026}
}GSI-Bench is released under the MIT License — see LICENSE.
Subdirectories containing code derived from third-party projects retain their own licenses: