AVD2: Accident Video Diffusion for Accident Video Description

2025 IEEE International Conference on Robotics & Automation (ICRA2025)

AVD2: Accident Video Diffusion for Accident Video Description

The First Work to Generate Accident Videos:

This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.

Created by:
Cheng Li^[1,2,*], Keyuan Zhou^[1,3,*], Tong Liu^[1,4,*], Yu Wang^[1,5,*], Mingqiao Zhuang^[6],
Huan-ang Gao^[1], Bu Jin^[1], and Hao Zhao^[1,7,8,†]

* Indicates equal contribution.
† The corresponding author.

Affiliations:

Institute for AI Industry Research (AIR), Tsinghua University.
Academy of Interdisciplinary Studies, the Hong Kong University of Science and Technology.
College of Communication Engineering, Jilin University.
School of Cyber Science and Engineering, Nanjing University of Science and Technology.
School of Automation, Beijing Institute of Technology.
College of Foreign Language and Literature, Fudan University.
Beijing Academy of Artificial Intelligence (BAAI).
Lightwheel AI.

Our System Framework:

Our AVD2 Project Video is available at:

AVD2 Project Video: https://youtu.be/iGdSIofB_k8

Introduction

We propose a novel framework, AVD2 (Accident Video Diffusion for Accident Video Description), which enhances transparency and explainability in autonomous driving systems by providing detailed natural language narrations and reasoning for accident scenarios. AVD2 jointly tackles both the accident description and prevention tasks, offering actionable insights through a shared video representation.This repository includes (will be released soon) the full implementation of AVD2, along with the training and evaluation setups, the generated accident dataset EMMAU dataset and the conda environment.

Note

We have uploaded the required environment of our AVD2 system.
We have released the whole raw EMM-AU dataset (including raw MM-AU dataset and the raw generation videos.
We have released the whole processed dataset of the EMM-AU dataset.
We have released the instructions and codes for the data augmentation (including super-resolution code and the instructions for Open-Sora finetuning).
We have released the checkpoint file of our fintuned improved Open-Sora 1.2 model.
We have released the data preprocessing codes ("/root/src/prepro/") and the model evaluation codes ("/root/src/evalcap/"&"/root/evaluation/") of the project.

Getting Started Environment

Create conda environment:

conda create --name AVD2 python=3.8

Install torch:

pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.html

Install apex:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..
rm -rf apex

Install mpi4py:

conda install -c conda-forge mpi4py openmpi

Install other dependencies and packages

pip install -r requirements.txt

More Details for our System

Our AVD2 framework is based on the Action-aware Driving Caption Transformer (ADAPT) and Self Critical Sequence Training (SCST).
The codes and more information about ADAPT and SCST can be found and referenced here:
ADAPT: https://arxiv.org/pdf/2302.00673
ADAPT codes: https://github.com/jxbbb/ADAPT/tree/main?tab=MIT-1-ov-file
SCST: https://arxiv.org/abs/1612.00563
SCST codes: https://github.com/ruotianluo/self-critical.pytorch

Dataset

This part includes the Dataset Preprocessing code, the Raw Dataset (including the whole EMM-AU dataset), the codes and steps to do the data augmentation and the Processed Dataset.

Dataset Preprocessing

Need to change the name of the train/val/test dataset and the locations.

cd src
cd prepro
sh preprocess.sh

Raw Dataset Download

EMM-AU(Enhanced MM-AU Dataset) contains "Raw MM-AU Dataset" and the "Enhanced Generated Videos".

Parts	Download
Raw MM-AU Dataset	Official Github Page
Our Enhanced Generated Videos	HuggingFace

Data Augmentation

We utilized Project Open-Sora 1.2 to inference the "Enhanced Part" of EMM-AU. You can reference Open-Sora Official GitHub Page for installation.

Fine-tuning for Open-Sora 1.2

Before fine-tuning, you need to prepare a csv file. HERE IS A METHOD
An example ready for training:

path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1

Then use the bash command to train new model or fine-tuned model(based on YOUR_PRETRAINED_CKPT).
You can also change the training config in "configs/opensora-v1-2/train/stage3.py"

# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
    configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
    configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

Inference with Open-Sora 1.2

You can Download our pretrained model for Accident Videos Generation.

# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --prompt "a beautiful waterfall"

# batch generation(need a txt file, each line has a single prompt)
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
  --num-frames 4s --resolution 720p --aspect-ratio 9:16 \
  --num-sampling-steps 30 --flow 5 --aes 6.5 \
  --prompt-path YOUR_PROMPT.TXT \
  --batch-size 1 \
  --loop 1 \
  --save-dir YOUR_SAVE_DIR \
  --ckpt-path YOUR_CHECKPOINT

RRDBNet Super-Resolution

The conda environment for the super-resolution part can be installed as:

conda create --name S_R python=3.8
source activate S_R
cd src/Super_resolution
pip install -r requirements.txt

Also, you may need to install these two code-base:
The first one:

pip install git+https://github.com/XPixelGroup/BasicSR.git

The second one:

pip install git+https://github.com/xinntao/Real-ESRGAN.git

Then running the RRDBNet model code within the Real-ESRGAN framework to do the super-resolution steps for the dataset.

python Super_Resolution.py

Processed Dataset Download

You can download the Processed_EMM-AU_Dataset in our HuggingFace.
All of the captions (annotations) document for the 2000 generated videos has been released in the ("root/Process_Dataset/generated_2000videos_captions.json").

Download Our Fine-tuned Open-Sora 1.2 model for Video Generation

You can download the checkpoint of the pretrained_model_for_video_generation in our HuggingFace. This is our improved pretrained Open-Sora 1.2 model by 2 steps fine-tuning based on the original official pretrained Open-Sora.

Train the Basic Model

conda activate AVD2
sh scripts/BDDX_multitask.sh

Testing/Evaluation

You can download the output from the ("/root/output/checkpoint")
To evaluate the output, you need to Modify the data format firstly:

cd evaluation
python tsv2coco.py
python json2coco.py

Here, we provided the right Transformed data format ("/root/evaluation/ground_truth_captions1", "/root/evaluation/ground_truth_captions2","/root/evaluation/generated_captions1","/root/evaluation/generated_captions1").
Then, you can run the testing/evaluation codes here:

pip install pycocoevalcap -i https://pypi.tuna.tsinghua.edu.cn/simple
# or
pip install pycocoevalcap
python pycocoevaluationmetric.py

Visualization

These are the random examples of the generated accident frames in our EMMAU dataset:

This is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):

Accident example 1:

AVD2 Prediction
Description: A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
Avoidance: Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the vehicles which drive on the main road or leave the main road. Vehicles that drive on the auxiliary road should give way to the vehicles which drive off the main road.

ChatGPT-4o Prediction
Description: A vehicle approaches a busy intersection and fails to notice another car coming from the side; The vehicle abruptly brakes to avoid a collision, but the close proximity creates a dangerous situation.
Avoidance: Drivers should always reduce speed when approaching intersections and remain alert to traffic from all directions. Maintaining a safe distance and carefully observing other vehicles is essential to prevent accidents at intersections.

GroundTruth
Description: Lead vehicle stops; Vehicles do not give way to normal driving vehicles when turning or changing lanes.
Avoidance: Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the vehicles which drive on the main road or leave the main road. Vehicles that drive on the auxiliary road should give way to the vehicles which drive off the main road.

Accident example 2:

AVD2 Prediction
Description: A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
Avoidance: Ego-cars should not exceed the speed limit during driving, slow down when passing intersections or crosswalks, especially for areas with many pedestrians.

ChatGPT-4o Prediction
Description: A vehicle makes a sharp turn at an intersection without signaling; The vehicle behind is forced to brake abruptly due to insufficient reaction time.
Avoidance: Drivers should signal well in advance before making turns at intersections. Maintaining a safe distance from other vehicles and anticipating sudden turns can help prevent accidents.

GroundTruth
Description: Vehicles meet on the road; Vehicles drive too fast with short braking distance.
Avoidance: Vehicles should not exceed the speed limit during driving, especially in areas with many pedestrians. Vehicles should slow down when passing intersections or crosswalks, and observe the traffic carefully.

Acknowledgements

We are grateful for the support of the Institute for AIR at Tsinghua University, Lightwheel AI and Kairui Ding's help on our project, and the LOTVS-MMAU (Multi-Modal Accident video Understanding) team for open-sourcing and sharing the MM-AU dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
Process_Dataset		Process_Dataset
datasets		datasets
datasets_part		datasets_part
evaluation		evaluation
images		images
output/checkpoint		output/checkpoint
scripts		scripts
src		src
Data_Split.py		Data_Split.py
LICENSE		LICENSE
README.md		README.md
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2025 IEEE International Conference on Robotics & Automation (ICRA2025)

AVD2: Accident Video Diffusion for Accident Video Description

The First Work to Generate Accident Videos:

This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.

Our System Framework:

Our AVD2 Project Video is available at:

Introduction

Note

Getting Started Environment

More Details for our System

Dataset

Dataset Preprocessing

Raw Dataset Download

Data Augmentation

Fine-tuning for Open-Sora 1.2

Inference with Open-Sora 1.2

RRDBNet Super-Resolution

Processed Dataset Download

Download Our Fine-tuned Open-Sora 1.2 model for Video Generation

Train the Basic Model

Testing/Evaluation

Visualization

These are the random examples of the generated accident frames in our EMMAU dataset:

This is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):

Accident example 1:

Accident example 2:

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

2025 IEEE International Conference on Robotics & Automation (ICRA2025)

AVD2: Accident Video Diffusion for Accident Video Description

The First Work to Generate Accident Videos:

This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.

Our System Framework:

Our AVD2 Project Video is available at:

Introduction

Note

Getting Started Environment

More Details for our System

Dataset

Dataset Preprocessing

Raw Dataset Download

Data Augmentation

Fine-tuning for Open-Sora 1.2

Inference with Open-Sora 1.2

RRDBNet Super-Resolution

Processed Dataset Download

Download Our Fine-tuned Open-Sora 1.2 model for Video Generation

Train the Basic Model

Testing/Evaluation

Visualization

These are the random examples of the generated accident frames in our EMMAU dataset:

This is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):

Accident example 1:

Accident example 2:

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages