Towards Long Form Audio-visual Video Understanding (ACM TOMM 2024).

Dataset & Features

YouTube ID

The dataset is collected from YouTube, you can find the ID of each video in annotation files. Due to data privacy policies, we are currently unable to release the original videos publicly.

Features

We use VGGish to extract audio features, use ResNet18 and R(2+1)D-18 to extract visual features.

VGGish feature: Google Drive, Baidu Drive (pwd: lfav), (~212M).

ResNet18 feature: Google Drive, Baidu Drive (pwd: lfav), (~1.9G).

R(2+1)D-18feature: Google Drive, Baidu Drive (pwd: lfav), (~1.9G).

Annotations

Label files are in the folder LFAV_dataset.

training set

# LFAV training set annotations
cd LFAV_dataset
cd ./train
train_audio_weakly.csv: video-level audio annotaions of training set
train_visual_weakly.csv: video-level visual annotaions of training set
train_weakly.csv: video-level annotations (union of video-level audio annotations and visual annotations) of training set

validation set

# LFAV validation set annotations
cd LFAV_dataset
cd ./val
val_audio_weakly.csv: video-level audio annotaions of validation set
val_visual_weakly.csv: video-level visual annotaions of validation set
val_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of validation set
val_audio.csv: event-level audio annotaions of validation set
val_visual.csv: event-level visual annotaions of validation set

testing set

# LFAV testing set annotations
cd LFAV_dataset
cd ./test
test_audio_weakly.csv: video-level audio annotaions of testing set
test_visual_weakly.csv: video-level visual annotaions of testing set
test_weakly_av.csv: video-level annotations (union of video-level audio annotations and visual annotations) of testing set
test_audio.csv: event-level audio annotaions of testing set
test_visual.csv: event-level visual annotaions of testing set

Train and test

Source code is in the folder src.

The script of training all three phases is in:

src/scripts/train_s3.sh

If you want to train one or two phases, just edit the arg "num_stages" to 1 or 2.

The script of testing all three phases is in:

src/scripts/test_s3.sh

We also provide our trained weights of the complete method (three phases): Google Drive, Baidu Drive (pwd: lfav).

Publication(s)

If you find our work useful in your research, please cite our paper.

@article{hou2024toward,
  title={Toward Long Form Audio-Visual Video Understanding},
  author={Hou, Wenxuan and Li, Guangyao and Tian, Yapeng and Hu, Di},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  volume={20},
  number={9},
  pages={1--26},
  year={2024},
  publisher={ACM New York, NY}
}

Acknowledgement

This research was supported by National Natural Science Foundation of China (NO.62106272), and Public Computing Cloud, Renmin University of China.

The source code referenced AVVP-ECCV20.

License

This project is released under the CC BY-NC 4.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LFAV_dataset		LFAV_dataset
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Long Form Audio-visual Video Understanding (ACM TOMM 2024).

Dataset & Features

YouTube ID

Features

Annotations

training set

validation set

testing set

Train and test

Publication(s)

Acknowledgement

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

GeWu-Lab/LFAV

Folders and files

Latest commit

History

Repository files navigation

Towards Long Form Audio-visual Video Understanding (ACM TOMM 2024).

Dataset & Features

YouTube ID

Features

Annotations

training set

validation set

testing set

Train and test

Publication(s)

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages