GitHub - WJ-CV/VGGDrive: [CVPR 2026] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

✨ VGGDrive: Empowering Vision-Language Models ✨
with Cross-View Geometric Grounding for Autonomous Driving

📢 News

[2026/03/09] 🌐 Project page is live: demo
[2026/02/26] 🚀 Released VGGDrive NAVSIM v1 weights and inference code.
[2026/02/24] 👉 We released our paper on arXiv.
[2026/02/21] 🎉🎉🎉 Accepted to CVPR 2026.

🔬 Project Overview

🧩 Conventional VLMs in autonomous driving “understand language but lack geometric insight.” Even when augmented with constructed Q&A data for auxiliary training, such approaches provide only superficial improvements and fail to address the core limitation in cross-view 3D spatial understanding.

💡 VGGDrive moves beyond data-level fixes and charts a new course by upgrading the capability structure itself. It introduces a mature 3D foundation model as a geometric backbone for VLMs, establishing a new technical paradigm that empowers Vision-Language Agents (VLAs) with 3D modeling capability and provides a scalable, sustainable pathway for enhancing autonomous driving systems.

🛠️ The core innovation lies in the design of a plug-and-play Cross-View Geometric Enabler (CVGE). Through a hierarchical adaptive injection mechanism, VGGDrive achieves deep coupling between a frozen 3D foundation model and a VLM without altering the original VLM architecture. This mechanism efficiently injects 3D geometric features into the model, enabling genuine cross-view 3D geometric modeling capability for autonomous driving VLAs.

📈 Importantly, VGGDrive is not limited to single-task optimization. It consistently improves performance across five mainstream autonomous driving benchmarks, covering cross-view risk perception, scene understanding, motion and state prediction, and trajectory planning, thereby enhancing the full pipeline from perception to decision-making.

🏗️ Framework

🏛️ Model Zoo

Model	Dataset	Download	Qwen_json
VGGDrive	NAVSIM	ckpt	train & test
VGGDrive	NuInstruct		train & test
VGGDrive	DriveLM	submission.json	train & test
VGGDrive	OmniDrive		train & test
VGGDrive	NuScenes		train & test

⚠️ Prerequisite:

Please download the pretrained VGGT model weights (model.pt) from vggt and place it in the ./vggt folder.

🏁 Quick Start

1. Environment

We recommend using Python 3.10+ and CUDA 12.x. The core dependencies used in this project include:

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install transformers==4.49.0 accelerate==1.3.0 datasets==3.2.0
pip install vllm==0.7.1 flash-attn==2.7.4.post1 xformers==0.0.28.post3
pip install timm==1.0.14 peft==0.14.0 bitsandbytes==0.45.2
pip install opencv-python==4.11.0.86 pillow==11.1.0
pip install numpy==1.26.4 pandas==2.2.3 scipy==1.15.2 scikit-learn==1.6.1
pip install nuscenes-devkit==1.1.11 pyquaternion==0.9.9 shapely==1.8.5.post1

Alternatively, install all dependencies from the provided environment file:

pip install -r requirements.txt

2. Run Inference

Run the NAVSIM inference script to generate prediction results:

bash run_scripts/inference_navsim.sh

2. Evaluation

Please follow the official NAVSIM v1.1 evaluation protocol: NAVSIM-v1.1 NAVSIM evaluates end-to-end driving performance with simulation-based metrics such as progress and time-to-collision under a non-reactive simulation setting. For detailed setup, dataset preparation, and metric computation, please refer to the official NAVSIM repository.

📌 Citation

@article{wang2026vggdrive,
  title={VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving},
  author={Wang, Jie and Li, Guang and Huang, Zhijian and Dang, Chenxu and Ye, Hangjun and Han, Yahong and Chen, Long},
  journal={arXiv preprint arXiv:2602.20794},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
inject_utils		inject_utils
local_scripts		local_scripts
open_r1		open_r1
run_scripts		run_scripts
vggt		vggt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

✨ VGGDrive: Empowering Vision-Language Models ✨
with Cross-View Geometric Grounding for Autonomous Driving

📢 News

🔬 Project Overview

🏗️ Framework

🏛️ Model Zoo

🏁 Quick Start

1. Environment

2. Run Inference

2. Evaluation

📌 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

✨ VGGDrive: Empowering Vision-Language Models ✨ with Cross-View Geometric Grounding for Autonomous Driving

📢 News

🔬 Project Overview

🏗️ Framework

🏛️ Model Zoo

🏁 Quick Start

1. Environment

2. Run Inference

2. Evaluation

📌 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✨ VGGDrive: Empowering Vision-Language Models ✨
with Cross-View Geometric Grounding for Autonomous Driving

Packages