The dataset is collected from YouTube, you can find the ID of each video in annotation files. Due to data privacy policies, we are currently unable to release the original videos publicly.
We use VGGish to extract audio features, use ResNet18 and R(2+1)D-18 to extract visual features.
VGGish feature: Google Drive, Baidu Drive (pwd: lfav), (~212M).
ResNet18 feature: Google Drive, Baidu Drive (pwd: lfav), (~1.9G).
R(2+1)D-18feature: Google Drive, Baidu Drive (pwd: lfav), (~1.9G).
Label files are in the folder LFAV_dataset.
# LFAV training set annotations
cd LFAV_dataset
cd ./train
train_audio_weakly.csv: video-level audio annotaions of training set
train_visual_weakly.csv: video-level visual annotaions of training set
train_weakly.csv: video-level annotations (union of video-level audio annotations and visual annotations) of training set
# LFAV validation set annotations
cd LFAV_dataset
cd ./val
val_audio_weakly.csv: video-level audio annotaions of validation set
val_visual_weakly.csv: video-level visual annotaions of validation set
val_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of validation set
val_audio.csv: event-level audio annotaions of validation set
val_visual.csv: event-level visual annotaions of validation set
# LFAV testing set annotations
cd LFAV_dataset
cd ./test
test_audio_weakly.csv: video-level audio annotaions of testing set
test_visual_weakly.csv: video-level visual annotaions of testing set
test_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of testing set
test_audio.csv: event-level audio annotaions of testing set
test_visual.csv: event-level visual annotaions of testing set
Source code is in the folder src.
The script of training all three phases is in:
src/scripts/train_s3.sh
If you want to train one or two phases, just edit the arg "num_stages" to 1 or 2.
The script of testing all three phases is in:
src/scripts/test_s3.sh
We also provide our trained weights of the complete method (three phases): Google Drive, Baidu Drive (pwd: lfav).
If you find our work useful in your research, please cite our paper.
@article{hou2024toward,
title={Toward Long Form Audio-Visual Video Understanding},
author={Hou, Wenxuan and Li, Guangyao and Tian, Yapeng and Hu, Di},
journal={ACM Transactions on Multimedia Computing, Communications and Applications},
volume={20},
number={9},
pages={1--26},
year={2024},
publisher={ACM New York, NY}
}
This research was supported by National Natural Science Foundation of China (NO.62106272), and Public Computing Cloud, Renmin University of China.
The source code referenced AVVP-ECCV20.
This project is released under the CC BY-NC 4.0 License.