GSI-Bench: Exploring Spatial Intelligence from a Generative Perspective

Language: English | 中文见 README_zh.md

🎉 Accepted to CVPR 2026

Official implementation of the paper:

Exploring Spatial Intelligence from a Generative Perspective CVPR 2026 [Paper] [arXiv] [Project Page]

GSI-Bench evaluates the ability of generative models to understand and manipulate 3D spatial relationships in indoor scenes.

Metric	Full Name	What It Measures
IC	Instruction Compliance	Does the output actually perform the requested spatial operation?
SA	Spatial Accuracy	Is the 3D displacement, rotation, or scale close to the ground-truth geometry?
AC	Appearance Consistency	Are object identity, category, and appearance preserved after editing?
EL	Edit Locality	Is the rest of the scene left untouched outside the intended region?

Quick Navigation

If you only want to evaluate your model on GSI-Bench, go directly to Evaluation.

Steps 1 and 2 document how we constructed the benchmark data. They are open-sourced for transparency and reproducibility, but are not required for running evaluations.

GSI-Bench/
├── evaluation/     # Evaluation framework (IC / SA / EL / AC)  ← start here
├── robothor/       # [Optional] Data generation pipeline 1: RoboTHOR indoor scenes
├── mesatask/       # [Optional] Data generation pipeline 2: MesaTask tabletop scenes
├── paper/          # Paper PDF
└── tests/          # Unit & integration tests

Evaluation

What you can reproduce:

Goal	Needs	Doc
Run IC/SA/EL/AC on your edited images	Eval datasets + weights + `eval/<model>/` layout	This section
Reproduce paper BAGEL × fine scores	Download `bagel_example/` (~265MB, not in Git)	`REPRODUCE_BAGEL_RESULTS.en.md`
Regenerate BAGEL images	External BAGEL repo (not full stack here)	Same doc, Section 8

1. Environment Setup

conda create -n gsi-eval python=3.10 -y
conda activate gsi-eval

cd evaluation

# Install PyTorch matching your CUDA version (example: CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Install mmcv with C++ ops
pip install -U openmim && mim install mmcv

# Install remaining dependencies
pip install -r requirements.txt

# Optional: build GroundingDINO for text-prompt detection
pip install -e ./src/groundingdino --no-build-isolation

2. Download Model Weights

Install the Hugging Face CLI first if needed:

pip install -U huggingface_hub

Weight	Size	Source
`other_exp_ckpt.pth` (DetAny3D)	~500MB	OpenDriveLab/DetAny3D
`sam_vit_h_4b8939.pth` (SAM ViT-H)	~2.4GB	Meta AI
`dinov2_vitl14_pretrain.pth` (DINOv2)	~1.1GB	Meta AI
`groundingdino_swinb_cogcoor.pth` (optional)	~690MB	IDEA-Research
`step_10000_nocfg/ema.safetensors` (BAGEL fine-tune, optional generation checkpoint)	~28GB	GSI-Bench/bagel_finetune_step_10000_nocfg

After the checkpoint file is present in the model repo, download it with:

hf download GSI-Bench/bagel_finetune_step_10000_nocfg \
  step_10000_nocfg/ema.safetensors \
  --local-dir bagel_finetune_step_10000_nocfg

Place all weights in one directory, then run:

bash prepare_weights.sh <path_to_weight_directory>
# Creates symlinks under checkpoints/ and GroundingDINO/weights/

3. Download Evaluation Datasets

Download the four GSI-Bench evaluation dataset archives from Hugging Face:

hf download GSI-Bench/GSI-Bench \
  fine_dataset.zip mesatask_dataset.zip bathroom_dataset.zip robothor_dataset.zip \
  --repo-type dataset --local-dir GSI-Bench

Then prepare the dataset links:

cd evaluation
bash prepare_datasets.sh ../GSI-Bench
# Creates symlinks: fine_dataset/  mesatask_dataset/  bathroom_dataset/  robothor_dataset/

The dataset page also includes a small preview split for the Hugging Face Dataset Viewer: https://huggingface.co/datasets/GSI-Bench/GSI-Bench.

4. Generate Edited Images with Your Model

Your model should produce edited images following the naming convention:

eval/<model_name>/generated_images_fine/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_mesatask/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_bathroom/<img_id>_edit_<query_id>.png
eval/<model_name>/generated_images_robothor/<img_id>_edit_<query_id>.png

examples/inference.py is a reference skeleton for BAGEL-style outputs; full generation requires the BAGEL codebase. To reproduce our published BAGEL numbers without regenerating images, download bagel_example/ and follow evaluation/REPRODUCE_BAGEL_RESULTS.en.md.

5. Run Evaluation

cd evaluation
export PYTHONPATH=$PWD:$PYTHONPATH

# IC / SA / EL evaluation (iterates all models × all datasets)
bash eval.sh

# (Optional) MLLM-based AC scoring — requires vLLM + Qwen3-VL (see evaluation/requirements-mllm.txt)
cd mllm_eval
pip install -r ../requirements-mllm.txt   # plus vllm for your CUDA build
bash eval_infer.sh <qwen3_vl_model_path> default <port>
# Writes predictions to mllm_eval/infer_results/
cd ..

# Aggregate all metrics into a final report
python -m eval.aggregate \
  --root-dir ./eval \
  --output-dir ./eval_results \
  --mllm-eval-dir ./mllm_eval/infer_results

cd ..   # back to repo root

Output: eval_results/ with per-model, per-dataset JSON files containing IC/SA/EL/AC scores.

See evaluation/eval/README.md for detailed input format and troubleshooting.

Data Generation Pipelines (Optional)

The following two pipelines document how we constructed the GSI-Bench data. They are not needed for evaluation — the evaluation datasets are provided as downloads above.

Pipeline 1: RoboTHOR Indoor Scenes

Environment:

conda create -n gsi-robothor python=3.10 -y
conda activate gsi-robothor
pip install -r robothor/requirements.txt
# Dependencies: ai2thor>=5.0.0, numpy, Pillow, matplotlib
# AI2-THOR downloads scene assets automatically on first run (~2GB)
# Requires: NVIDIA GPU + CloudRendering (headless) or X server (display)

Generate data:

cd robothor

# 1) Generate base views + camera-relative commands for ALL 60 training scenes
#    Output: data/outputs/train/with_physics/
bash scripts/generate_train.sh

# 2) Generate additional command types (requires pregenerated views from step 1)
bash scripts/generate_train_object.sh          # object-relative positioning
bash scripts/generate_train_rotate.sh           # rotation commands
bash scripts/generate_train_receptacle.sh       # receptacle placement
bash scripts/generate_train_spatial_remove.sh    # spatial removal
bash scripts/generate_train_agent_camera.sh      # agent camera movement

# 3) Generate validation data
bash scripts/generate_val_agent_camera.sh

cd ..   # back to repo root

Output: data/outputs/{train,val}/ with JSONL records + RGB/depth/segmentation images per view per command.

Timing: ~2–5 min per scene depending on GPU. Full 60 scenes: several hours.

See robothor/README.md for details.

Pipeline 2: MesaTask Tabletop Scenes

Environment:

conda create -n gsi-mesatask python=3.10 -y
conda activate gsi-mesatask
pip install -r mesatask/requirement.txt
# For inference (optional): pip install torch torchvision
# For rendering (optional): download Blender 4.3+ from https://www.blender.org/download/
# For physical optimization (optional): conda install -c conda-forge drake

Download MesaTask-10K dataset:

cd mesatask
git lfs install
git clone https://huggingface.co/datasets/InternRobotics/MesaTask-10K MesaTask-10K

# Prepare asset library (from dataset archives)
cd MesaTask-10K/Assets_library_archive
cat Assets_library_backup.tar.gz.* > Assets_library_merged.tar.gz
tar -xzvf Assets_library_merged.tar.gz -C ../Assets_library/
cd ../..

Generate data:

cd mesatask

# 1) Generate atomic transforms (move, rotate, scale)
python generate_atomic_transforms.py \
  --input-dir MesaTask-10K/Layout_info \
  --asset-annotation MesaTask-10K/Asset_annotation.json \
  --output-dir transformed_layouts \
  --num-variants 10 --seed 42

# 2) Render all layouts (requires Blender)
python dataset/vis_batch.py transformed_layouts \
  --output_dir dataset/vis_final --parallel 4

# 3) Assemble image-editing dataset
python organize_image_editing_dataset.py \
  --transformed-dir transformed_layouts \
  --vis-dir dataset/vis_final \
  --output-dir dataset/image_editing_dataset

cd ..   # back to repo root

Timing: Step 1 takes ~10 min for 10K scenes. Step 2 (rendering) depends on machine and parallelism.

See mesatask/README.md for details.

Verify the Repo

git clone <this-repo-url> GSI-Bench && cd GSI-Bench

# Run tests (no GPU or data needed)
pip install pytest
python -m pytest tests/ -v    # 43 tests should pass

Environment Requirements Summary

Component	Python	GPU	Conda Env
tests/	3.8+	No	any
evaluation/	3.10	NVIDIA (DetAny3D)	`gsi-eval`
robothor/	3.10	NVIDIA (CloudRendering)	`gsi-robothor`
mesatask/	3.10	Optional	`gsi-mesatask`

Citation

@article{zhu2026exploring,
  title={Exploring Spatial Intelligence from a Generative Perspective},
  author={Zhu, Muzhi and Jiang, Shunyao and Zheng, Huanyi and Luo, Zekai and Zhong, Hao and Li, Anzhou and Wang, Kaijun and Rong, Jintao and Liu, Yang and Chen, Hao and Lin, Tao and Shen, Chunhua},
  journal={arXiv preprint arXiv:2604.20570},
  year={2026}
}

License

GSI-Bench is released under the MIT License — see LICENSE.

Subdirectories containing code derived from third-party projects retain their own licenses:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GSI-Bench: Exploring Spatial Intelligence from a Generative Perspective

Quick Navigation

Evaluation

1. Environment Setup

2. Download Model Weights

3. Download Evaluation Datasets

4. Generate Edited Images with Your Model

5. Run Evaluation

Data Generation Pipelines (Optional)

Pipeline 1: RoboTHOR Indoor Scenes

Pipeline 2: MesaTask Tabletop Scenes

Verify the Repo

Environment Requirements Summary

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
evaluation		evaluation
mesatask		mesatask
paper		paper
robothor		robothor
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md

Folders and files

Latest commit

History

Repository files navigation

GSI-Bench: Exploring Spatial Intelligence from a Generative Perspective

Quick Navigation

Evaluation

1. Environment Setup

2. Download Model Weights

3. Download Evaluation Datasets

4. Generate Edited Images with Your Model

5. Run Evaluation

Data Generation Pipelines (Optional)

Pipeline 1: RoboTHOR Indoor Scenes

Pipeline 2: MesaTask Tabletop Scenes

Verify the Repo

Environment Requirements Summary

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages