This repo is the official PyTorch implementation for:
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer. Accepted by CVPR 2025.
In this paper, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and
pixel context grouping, facilitating audio-visual information aggregation.
After setting up the environment, clone this repo:
conda create -n vct_avs python==3.8 -y
conda activate vct_avs
git clone https://github.com/spyflying/VCT_AVS.git
cd VCT_AVS
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 cudatoolkit=11.7 -c pytorch -c nvidia
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .
cd ..
pip install -r requirements.txt
cd models/modeling/pixel_decoder/ops
bash make.sh
This repository is built upon COMBO-AVS codebase. Please refer to the original combo-avs repository for more detailed information of installation.
Download AVSBench dataset and organize the data folders as follos:
|--AVS_dataset
|--AVSBench_semantic/
|--AVSBench_object/Multi-sources/
|--AVSBench_object/Single-source/
Process the dataset for 384x384 resolution by running:
python avs_tools/preprocess_avss_audio.py
python avs_tools/generate_data_384/ms3_process.py
python avs_tools/generate_data_384/s4_process.py
python avs_tools/generate_data_384/ss_process.py
Download Swin-Base-384 pretrained on ImageNet-22K from download. Convert the original model with:
cd avs_tools
python swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl
Please refer to COMBO-AVS to download other pretrained models.
We provide the SOTA checkpoints for Swin-B-384 setting:
| Subset | M_J | M_F | HuggingFace Link |
|---|---|---|---|
| Single-Source | 86.2 | 93.4 | download |
| Multi-Source | 67.6 | 81.4 | download |
| Semantic | 52.5 | 56.9 | download |
Run the following commands to evaluate the given checkpoint:
sh scripts/$subset$_swinb_384_test.sh
Run the following commands for training:
sh scripts/$subset$_swinb_384_train.sh
If you find this repo useful for your research, please cite
@inproceedings{huang2025revisiting,
title={Revisiting Audio-Visual Segmentation with Vision-Centric Transformer},
author={Huang, Shaofei and Ling, Rui and Hui, Tianrui and Li, Hongyu and Zhou, Xu and Zhang, Shifeng and Liu, Si and Hong, Richang and Wang, Meng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}
For questions about our paper or code, please contact Shaofei Huang(nowherespyfly@gmail.com).
This repo is mostly derived from COMBO-AVS codebase. Thanks for their efforts.