This repository is an official implementation of AVD2: Accident Video Diffusion for Accident Video Description.
Created by:
Cheng Li[1,2,*], Keyuan Zhou[1,3,*], Tong Liu[1,4,*], Yu Wang[1,5,*], Mingqiao Zhuang[6],
Huan-ang Gao[1], Bu Jin[1], and Hao Zhao[1,7,8,†]
* Indicates equal contribution.
† The corresponding author.
Affiliations:
- Institute for AI Industry Research (AIR), Tsinghua University.
- Academy of Interdisciplinary Studies, the Hong Kong University of Science and Technology.
- College of Communication Engineering, Jilin University.
- School of Cyber Science and Engineering, Nanjing University of Science and Technology.
- School of Automation, Beijing Institute of Technology.
- College of Foreign Language and Literature, Fudan University.
- Beijing Academy of Artificial Intelligence (BAAI).
- Lightwheel AI.
AVD2 Project Video: https://youtu.be/iGdSIofB_k8
We propose a novel framework, AVD2 (Accident Video Diffusion for Accident Video Description), which enhances transparency and explainability in autonomous driving systems by providing detailed natural language narrations and reasoning for accident scenarios. AVD2 jointly tackles both the accident description and prevention tasks, offering actionable insights through a shared video representation.This repository includes (will be released soon) the full implementation of AVD2, along with the training and evaluation setups, the generated accident dataset EMMAU dataset and the conda environment.
We have uploaded the required environment of our AVD2 system.
We have released the whole raw EMM-AU dataset (including raw MM-AU dataset and the raw generation videos.
We have released the whole processed dataset of the EMM-AU dataset.
We have released the instructions and codes for the data augmentation (including super-resolution code and the instructions for Open-Sora finetuning).
We have released the checkpoint file of our fintuned improved Open-Sora 1.2 model.
We have released the data preprocessing codes ("/root/src/prepro/") and the model evaluation codes ("/root/src/evalcap/"&"/root/evaluation/") of the project.
Create conda environment:
conda create --name AVD2 python=3.8Install torch:
pip install torch==1.13.1+cu117 torchaudio==0.13.1+cu117 torchvision==0.14.1+cu117 -f https://download.pytorch.org/whl/torch_stable.htmlInstall apex:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--deprecated_fused_adam" --global-option="--xentropy" --global-option="--fast_multihead_attn" ./
cd ..
rm -rf apexInstall mpi4py:
conda install -c conda-forge mpi4py openmpiInstall other dependencies and packages
pip install -r requirements.txtOur AVD2 framework is based on the Action-aware Driving Caption Transformer (ADAPT) and Self Critical Sequence Training (SCST).
The codes and more information about ADAPT and SCST can be found and referenced here:
ADAPT: https://arxiv.org/pdf/2302.00673
ADAPT codes: https://github.com/jxbbb/ADAPT/tree/main?tab=MIT-1-ov-file
SCST: https://arxiv.org/abs/1612.00563
SCST codes: https://github.com/ruotianluo/self-critical.pytorch
This part includes the Dataset Preprocessing code, the Raw Dataset (including the whole EMM-AU dataset), the codes and steps to do the data augmentation and the Processed Dataset.
Need to change the name of the train/val/test dataset and the locations.
cd src
cd prepro
sh preprocess.shEMM-AU(Enhanced MM-AU Dataset) contains "Raw MM-AU Dataset" and the "Enhanced Generated Videos".
| Parts | Download |
|---|---|
| Raw MM-AU Dataset | Official Github Page |
| Our Enhanced Generated Videos | HuggingFace |
We utilized Project Open-Sora 1.2 to inference the "Enhanced Part" of EMM-AU. You can reference Open-Sora Official GitHub Page for installation.
Before fine-tuning, you need to prepare a csv file. HERE IS A METHOD
An example ready for training:
path, text, num_frames, width, height, aspect_ratio
/absolute/path/to/image1.jpg, caption, 1, 720, 1280, 0.5625
/absolute/path/to/video1.mp4, caption, 120, 720, 1280, 0.5625
/absolute/path/to/video2.mp4, caption, 20, 256, 256, 1Then use the bash command to train new model or fine-tuned model(based on YOUR_PRETRAINED_CKPT).
You can also change the training config in "configs/opensora-v1-2/train/stage3.py"
# one node
torchrun --standalone --nproc_per_node 8 scripts/train.py \
configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT
# multiple nodes
colossalai run --nproc_per_node 8 --hostfile hostfile scripts/train.py \
configs/opensora-v1-2/train/stage3.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPTYou can Download our pretrained model for Accident Videos Generation.
# text to video
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
--num-frames 4s --resolution 720p --aspect-ratio 9:16 \
--prompt "a beautiful waterfall"
# batch generation(need a txt file, each line has a single prompt)
python scripts/inference.py configs/opensora-v1-2/inference/sample.py \
--num-frames 4s --resolution 720p --aspect-ratio 9:16 \
--num-sampling-steps 30 --flow 5 --aes 6.5 \
--prompt-path YOUR_PROMPT.TXT \
--batch-size 1 \
--loop 1 \
--save-dir YOUR_SAVE_DIR \
--ckpt-path YOUR_CHECKPOINTThe conda environment for the super-resolution part can be installed as:
conda create --name S_R python=3.8
source activate S_R
cd src/Super_resolution
pip install -r requirements.txtAlso, you may need to install these two code-base:
The first one:
pip install git+https://github.com/XPixelGroup/BasicSR.gitThe second one:
pip install git+https://github.com/xinntao/Real-ESRGAN.gitThen running the RRDBNet model code within the Real-ESRGAN framework to do the super-resolution steps for the dataset.
python Super_Resolution.pyYou can download the Processed_EMM-AU_Dataset in our HuggingFace.
All of the captions (annotations) document for the 2000 generated videos has been released in the ("root/Process_Dataset/generated_2000videos_captions.json").
You can download the checkpoint of the pretrained_model_for_video_generation in our HuggingFace. This is our improved pretrained Open-Sora 1.2 model by 2 steps fine-tuning based on the original official pretrained Open-Sora.
conda activate AVD2
sh scripts/BDDX_multitask.shYou can download the output from the ("/root/output/checkpoint")
To evaluate the output, you need to Modify the data format firstly:
cd evaluation
python tsv2coco.py
python json2coco.pyHere, we provided the right Transformed data format ("/root/evaluation/ground_truth_captions1", "/root/evaluation/ground_truth_captions2","/root/evaluation/generated_captions1","/root/evaluation/generated_captions1").
Then, you can run the testing/evaluation codes here:
pip install pycocoevalcap -i https://pypi.tuna.tsinghua.edu.cn/simple
# or
pip install pycocoevalcap
python pycocoevaluationmetric.pyThis is the visualization of the Understanding ability of our AVD2 system (comparred with the ChatGPT-4o & ground truth):

AVD2 Prediction
Description:
A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
Avoidance:
Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the vehicles which drive on the main road or leave the main road. Vehicles that drive on the auxiliary road should give way to the vehicles which drive off the main road.
ChatGPT-4o Prediction
Description:
A vehicle approaches a busy intersection and fails to notice another car coming from the side; The vehicle abruptly brakes to avoid a collision, but the close proximity creates a dangerous situation.
Avoidance:
Drivers should always reduce speed when approaching intersections and remain alert to traffic from all directions. Maintaining a safe distance and carefully observing other vehicles is essential to prevent accidents at intersections.
GroundTruth
Description:
Lead vehicle stops; Vehicles do not give way to normal driving vehicles when turning or changing lanes.
Avoidance:
Before turning or changing lanes, vehicles should turn on the turn signal in advance, observe the surrounding vehicles and control the speed. When driving, vehicles should abide by traffic rules, and give the way for the normal running vehicles. Vehicles that will enter the main road should give way to the vehicles which drive on the main road or leave the main road. Vehicles that drive on the auxiliary road should give way to the vehicles which drive off the main road.

AVD2 Prediction
Description:
A vehicle changes lanes with the same direction to ego-car; Vehicles don't give way to normal driving vehicles when turning or changing lanes.
Avoidance:
Ego-cars should not exceed the speed limit during driving, slow down when passing intersections or crosswalks, especially for areas with many pedestrians.
ChatGPT-4o Prediction
Description:
A vehicle makes a sharp turn at an intersection without signaling; The vehicle behind is forced to brake abruptly due to insufficient reaction time.
Avoidance:
Drivers should signal well in advance before making turns at intersections. Maintaining a safe distance from other vehicles and anticipating sudden turns can help prevent accidents.
GroundTruth
Description:
Vehicles meet on the road; Vehicles drive too fast with short braking distance.
Avoidance:
Vehicles should not exceed the speed limit during driving, especially in areas with many pedestrians. Vehicles should slow down when passing intersections or crosswalks, and observe the traffic carefully.
We are grateful for the support of the Institute for AIR at Tsinghua University, Lightwheel AI and Kairui Ding's help on our project, and the LOTVS-MMAU (Multi-Modal Accident video Understanding) team for open-sourcing and sharing the MM-AU dataset.


