Skip to content

This is the official project repository for "DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation" (CVPR 2025)

Notifications You must be signed in to change notification settings

Hongbin98/DriveGEN

Repository files navigation

🌠 DriveGEN

This is the official project repository for DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation (CVPR 2025)

Abstract

In autonomous driving, vision-centric 3D detection aims to identify 3D objects from images. However, high data collection costs and diverse real-world scenarios limit the scale of training data. Once distribution shifts occur between training and test data, existing methods often suffer from performance degradation, known as Out-of-Distribution (OOD) problems. To address this, controllable Text-to-Image (T2I) diffusion offers a potential solution for training data enhancement, which is required to generate diverse OOD scenarios with precise 3D object geometry. Nevertheless, existing controllable T2I approaches are restricted by the limited scale of training data or struggle to preserve all annotated 3D objects. In this paper, we present DriveGEN, a method designed to improve the robustness of 3D detectors in Driving via Training-Free Controllable Text-to-Image Diffusion Generation. Without extra diffusion model training, DriveGEN consistently preserves objects with precise 3D geometry across diverse OOD generations, consisting of 2 stages: 1) Self-Prototype Extraction: We empirically find that self-attention features are semantic-aware but require accurate region selection for 3D objects. Thus, we extract precise object features via layouts to capture 3D object geometry, termed self-prototypes. 2) Prototype-Guided Diffusion: To preserve objects across various OOD scenarios, we perform semantic-aware feature alignment and shallow feature alignment during denoising. Extensive experiments demonstrate the effectiveness of DriveGEN in improving 3D detection.

Data Preparation

Monocular 3D object detection

  • 1️⃣ Download the KITTI dataset from the official website
  • 2️⃣ Download the splits (the ImageSets folder) from MonoTTA

Then,

mkdir data && cd data
ln -s /your_path_KITTI .
mv ./ImageSets ./your_path_KITTI

Multi-view 3D object detection

  • 1️⃣ Download the nuScenes dataset from the official website
  • 2️⃣ (Optional) Download the nuScenes-C dataset from the Robo3D benchmark

You can also download all generated images on Hugging Face 🤗

Installation

Build the conda environment via

conda env create -f environment.yml
conda activate driveGEN
pip install -r requirements.txt

Usage

Monocular 3D object detection

#### For self-prototypes extraction
python KITTI_s1_self_prototype_extraction.py

#### For prototype-guided image generation
python KITTI_s2_image_generation.py

Multi-view 3D object detection

#### For self-prototypes extraction
python nus_s1_self_prototype_extraction.py

#### For prototype-guided image generation
python nus_s2_image_generation.py

Citation

If our DriveGEN method is helpful in your research, please consider citing our paper:

@inproceedings{lin2025drivegen,
  title={Drivegen: Generalized and robust 3d detection in driving via controllable text-to-image diffusion generation},
  author={Lin, Hongbin and Guo, Zilu and Zhang, Yifan and Niu, Shuaicheng and Li, Yafeng and Zhang, Ruimao and Cui, Shuguang and Li, Zhen},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={27497--27507},
  year={2025}
}

Acknowledgment

The code is greatly inspired by (heavily from) the FreeControl🔗.

Correspondence

Please contact Hongbin Lin by [linhongbinanthem@gmail.com] if you have any questions. 📬

About

This is the official project repository for "DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation" (CVPR 2025)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages