This project provides the code for 'Audio-Visual Saliency Prediction with Multisensory Perception and Integration', Image and Vision Computing, 2024. Paper link.
You can download from the above original links or from the STAViS's resources.
-
SlowFast, X3D and MViTv2, facebookresearch/SlowFast
-
Uniformer, Sense-X/UniFormer
-
VideoSwin, SwinTransformer/Video-Swin-Transformer
-
MorphMLP, MTLab/MorphMLP
-
S3D, kylemin/S3D
-
ResNet18-VGGSound, hche11/VGGSound
😊Thank the above researchers for releasing their codes and sharing model weights!!!
The variants in the paper are testified. If you use other variants of the above models, you should change the corresponding .yaml files and settings in config.py.
For PySlowFast installation, you can refer to this, but there might not be compatible with our code.
If you use the PySlowFast codes in this repository, partial model codes' connection to Detectron2 is cut, thus you can ignore the installation for Detectron2.
timm==0.6.12
torch==1.11.0
The dataset directory structure should be
dataset/
video_frames/
.../ (directories of datasets names)
video_audio/
.../ (directories of datasets names)
annotations/
.../ (directories of datasets names)
fold_lists/
*.txt (lists of datasets splits)
You can download it from this. The model is first trained on SALICON and then finetuned on MIT1003.
Set paths to dataset, pretrained weight files and YAML files. Set selected backbone and more. The following setting is crucial:
_model_name
cfg.DATA.ROOT
_MOTION_WEIGHTS
cfg.MODEL.IMAGE_SALIENCY_ENCODER_WEIGHT
cfg.MODEL.AUDIO_ENCODER_WEIGHT
.PATH_CFG
Then run the code using
$ python train.py --session_name --split --num_workers --save_ckpt_freqClone this repository and download the three-split weights of our model from this link. Then run the code using
$ python inference.py --weight path/to/weight --path_data path/to/dataset --split split/of/dataset The MATLAB code is used for evaluation.
If you think this project is helpful, please feel free to cite our paper:
@article{XIE2024104955,
title = {Audio-visual saliency prediction with multisensory perception and integration},
journal = {Image and Vision Computing},
pages = {104955},
year = {2024},
issn = {0262-8856},
doi = {https://doi.org/10.1016/j.imavis.2024.104955},
url = {https://www.sciencedirect.com/science/article/pii/S0262885624000581},
author = {Jiawei Xie and Zhi Liu and Gongyang Li and Yingjie Song}
}