Fang Liu* Yuhao Liu* Ke Xu† Gerhard Petrus Hancke Rynson W.H. Lau†
City University of Hong Kong
*Equal Contribution †Corresponding Authors
Unlike previous methods that either achieve cross-scene generalization by being bounded to a predefined vocabulary or handle free-form language by overfitting to individual scenes, GenSplat is robust to free-form language queries and generalizable across 3DGS scene representations. Our key insight is to formulate a structured learning process to progressively align linguistic concepts with 3D Gaussians. It contains two novel technical contributions: Progressive Language Grounding Curriculum that structurally guides the model through learning category-level semantics to instance-level concepts and free-form language, preventing overfitting by building a generalizable language feature space. MLLM-guided Reasoning Module that leverages Multi-modal Large Language Models' semantic and spatial priors to enhance 3D localization and reasoning. Extensive cross-task evaluations — including 3D referring segmentation, 3D visual question answering, and 3D open-vocabulary understanding — demonstrate state-of-the-art performances and strong generalization capability.
- Release inference code
- Release training code
- Release pre-trained model weights
- Release evaluation scripts
Code will be released soon. Stay tuned!
If you find this work useful, please consider citing: