Skip to content

WJ-CV/VGGDrive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VGGDrive

✨ VGGDrive: Empowering Vision-Language Models ✨
with Cross-View Geometric Grounding for Autonomous Driving

📢 News

  • [2026/03/09] 🌐 Project page is live: demo
  • [2026/02/26] 🚀 Released VGGDrive NAVSIM v1 weights and inference code.
  • [2026/02/24] 👉 We released our paper on arXiv.
  • [2026/02/21] 🎉🎉🎉 Accepted to CVPR 2026.

🔬 Project Overview

🧩 Conventional VLMs in autonomous driving “understand language but lack geometric insight.” Even when augmented with constructed Q&A data for auxiliary training, such approaches provide only superficial improvements and fail to address the core limitation in cross-view 3D spatial understanding.

💡 VGGDrive moves beyond data-level fixes and charts a new course by upgrading the capability structure itself. It introduces a mature 3D foundation model as a geometric backbone for VLMs, establishing a new technical paradigm that empowers Vision-Language Agents (VLAs) with 3D modeling capability and provides a scalable, sustainable pathway for enhancing autonomous driving systems.

🛠️ The core innovation lies in the design of a plug-and-play Cross-View Geometric Enabler (CVGE). Through a hierarchical adaptive injection mechanism, VGGDrive achieves deep coupling between a frozen 3D foundation model and a VLM without altering the original VLM architecture. This mechanism efficiently injects 3D geometric features into the model, enabling genuine cross-view 3D geometric modeling capability for autonomous driving VLAs.

📈 Importantly, VGGDrive is not limited to single-task optimization. It consistently improves performance across five mainstream autonomous driving benchmarks, covering cross-view risk perception, scene understanding, motion and state prediction, and trajectory planning, thereby enhancing the full pipeline from perception to decision-making.


🏗️ Framework

fig3_2

🏛️ Model Zoo

Model Dataset Download Qwen_json
VGGDrive NAVSIM ckpt train & test
VGGDrive NuInstruct train & test
VGGDrive DriveLM submission.json train & test
VGGDrive OmniDrive train & test
VGGDrive NuScenes train & test

⚠️ Prerequisite:

Please download the pretrained VGGT model weights (model.pt) from vggt and place it in the ./vggt folder.

🏁 Quick Start

1. Environment

We recommend using Python 3.10+ and CUDA 12.x. The core dependencies used in this project include:

pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1
pip install transformers==4.49.0 accelerate==1.3.0 datasets==3.2.0
pip install vllm==0.7.1 flash-attn==2.7.4.post1 xformers==0.0.28.post3
pip install timm==1.0.14 peft==0.14.0 bitsandbytes==0.45.2
pip install opencv-python==4.11.0.86 pillow==11.1.0
pip install numpy==1.26.4 pandas==2.2.3 scipy==1.15.2 scikit-learn==1.6.1
pip install nuscenes-devkit==1.1.11 pyquaternion==0.9.9 shapely==1.8.5.post1

Alternatively, install all dependencies from the provided environment file:

pip install -r requirements.txt

2. Run Inference

Run the NAVSIM inference script to generate prediction results:

bash run_scripts/inference_navsim.sh

2. Evaluation

Please follow the official NAVSIM v1.1 evaluation protocol: NAVSIM-v1.1 NAVSIM evaluates end-to-end driving performance with simulation-based metrics such as progress and time-to-collision under a non-reactive simulation setting. For detailed setup, dataset preparation, and metric computation, please refer to the official NAVSIM repository.

📌 Citation

@article{wang2026vggdrive,
  title={VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving},
  author={Wang, Jie and Li, Guang and Huang, Zhijian and Dang, Chenxu and Ye, Hangjun and Han, Yahong and Chen, Long},
  journal={arXiv preprint arXiv:2602.20794},
  year={2026}
}

About

[CVPR 2026] VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages