✨ VGGDrive: Empowering Vision-Language Models ✨
with Cross-View Geometric Grounding for Autonomous Driving
- [2026/03/09] 🌐 Project page is live: demo
- [2026/02/26] 🚀 Released VGGDrive NAVSIM v1 weights and inference code.
- [2026/02/24] 👉 We released our paper on arXiv.
- [2026/02/21] 🎉🎉🎉 Accepted to CVPR 2026.
🧩 Conventional VLMs in autonomous driving “understand language but lack geometric insight.” Even when augmented with constructed Q&A data for auxiliary training, such approaches provide only superficial improvements and fail to address the core limitation in cross-view 3D spatial understanding.
💡 VGGDrive moves beyond data-level fixes and charts a new course by upgrading the capability structure itself. It introduces a mature 3D foundation model as a geometric backbone for VLMs, establishing a new technical paradigm that empowers Vision-Language Agents (VLAs) with 3D modeling capability and provides a scalable, sustainable pathway for enhancing autonomous driving systems.
🛠️ The core innovation lies in the design of a plug-and-play Cross-View Geometric Enabler (CVGE). Through a hierarchical adaptive injection mechanism, VGGDrive achieves deep coupling between a frozen 3D foundation model and a VLM without altering the original VLM architecture. This mechanism efficiently injects 3D geometric features into the model, enabling genuine cross-view 3D geometric modeling capability for autonomous driving VLAs.
| Model | Dataset | Download | Qwen_json |
|---|---|---|---|
| VGGDrive | NAVSIM | ckpt | train & test |
| VGGDrive | NuInstruct | train & test | |
| VGGDrive | DriveLM | submission.json | train & test |
| VGGDrive | OmniDrive | train & test | |
| VGGDrive | NuScenes | train & test |
⚠️ Prerequisite:Please download the pretrained VGGT model weights (
model.pt) from vggt and place it in the./vggtfolder.
We recommend using Python 3.10+ and CUDA 12.x. The core dependencies used in this project include:
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install transformers==4.49.0 accelerate==1.3.0 datasets==3.2.0
pip install vllm==0.7.1 flash-attn==2.7.4.post1 xformers==0.0.28.post3
pip install timm==1.0.14 peft==0.14.0 bitsandbytes==0.45.2
pip install opencv-python==4.11.0.86 pillow==11.1.0
pip install numpy==1.26.4 pandas==2.2.3 scipy==1.15.2 scikit-learn==1.6.1
pip install nuscenes-devkit==1.1.11 pyquaternion==0.9.9 shapely==1.8.5.post1Alternatively, install all dependencies from the provided environment file:
pip install -r requirements.txtRun the NAVSIM inference script to generate prediction results:
bash run_scripts/inference_navsim.shPlease follow the official NAVSIM v1.1 evaluation protocol: NAVSIM-v1.1 NAVSIM evaluates end-to-end driving performance with simulation-based metrics such as progress and time-to-collision under a non-reactive simulation setting. For detailed setup, dataset preparation, and metric computation, please refer to the official NAVSIM repository.
@article{wang2026vggdrive,
title={VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving},
author={Wang, Jie and Li, Guang and Huang, Zhijian and Dang, Chenxu and Ye, Hangjun and Han, Yahong and Chen, Long},
journal={arXiv preprint arXiv:2602.20794},
year={2026}
}
