Official code release for Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs.
Fangrui Zhu*, Hanhui Wang*, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, Huaizu Jiang
*Equal Contribution
📑 Paper (arXiv)
Dataset and Models
- We propose a perception-guided 2D prompting strategy, Struct2D Prompting, and conduct a detailed zero-shot analysis that reveals MLLMs’ ability to perform 3D spatial reasoning from structured 2D inputs alone.
- We introduce Struct2D-Set, a large-scale instructional tuning dataset with automatically generated, fine-grained QA pairs covering eight spatial reasoning categories grounded in 3D scenes.
- We fine-tune an open-source MLLM to achieve competitive performance across several spatial reasoning benchmarks, validating the real-world applicability of our framework.
conda create -n struct2d python=3.10 -y
conda activate struct2d
git clone git@github.com:neu-vi/struct2d.git
pip install -e ".[torch,metrics]" --no-build-isolationIf you find Struct2D helpful in your research, please consider citing:
@article{zhu2025struct2d,
title={Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs},
author={Zhu, Fangrui and Wang, Hanhui and Xie, Yiming and Gu, Jing and Ding, Tianye and Yang, Jianwei and Jiang, Huaizu},
journal={arXiv preprint arXiv:2506.04220},
year={2025}
}We thank the authors of GPT4Scene, LLaMA-Factory for inspiring discussions and open-sourcing their codebases.
