VCT_AVS

This repo is the official PyTorch implementation for:

Revisiting Audio-Visual Segmentation with Vision-Centric Transformer. Accepted by CVPR 2025.

In this paper, we propose a new Vision-Centric Transformer (VCT) framework that leverages vision-derived queries to iteratively fetch corresponding audio and visual information, enabling queries to better distinguish between different sounding objects from mixed audio and accurately delineate their contours. Additionally, we also introduce a Prototype Prompted Query Generation (PPQG) module within our VCT framework to generate vision-derived queries that are both semantically aware and visually rich through audio prototype prompting and pixel context grouping, facilitating audio-visual information aggregation.

Installation

After setting up the environment, clone this repo:

conda create -n vct_avs python==3.8 -y
conda activate vct_avs

git clone https://github.com/spyflying/VCT_AVS.git
cd VCT_AVS

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 cudatoolkit=11.7 -c pytorch -c nvidia
git clone https://github.com/facebookresearch/detectron2
cd detectron2
pip install -e .

cd ..
pip install -r requirements.txt

cd models/modeling/pixel_decoder/ops
bash make.sh

This repository is built upon COMBO-AVS codebase. Please refer to the original combo-avs repository for more detailed information of installation.

Data Preparation

Download AVSBench dataset and organize the data folders as follos:

|--AVS_dataset
   |--AVSBench_semantic/
   |--AVSBench_object/Multi-sources/
   |--AVSBench_object/Single-source/

Process the dataset for 384x384 resolution by running:

python avs_tools/preprocess_avss_audio.py
python avs_tools/generate_data_384/ms3_process.py
python avs_tools/generate_data_384/s4_process.py
python avs_tools/generate_data_384/ss_process.py

Download Pretrained Models

Download Swin-Base-384 pretrained on ImageNet-22K from download. Convert the original model with:

cd avs_tools
python swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl

Please refer to COMBO-AVS to download other pretrained models.

Model Zoo

We provide the SOTA checkpoints for Swin-B-384 setting:

Subset	M_J	M_F	HuggingFace Link
Single-Source	86.2	93.4	download
Multi-Source	67.6	81.4	download
Semantic	52.5	56.9	download

Testing

Run the following commands to evaluate the given checkpoint:

sh scripts/$subset$_swinb_384_test.sh

Training

Run the following commands for training:

sh scripts/$subset$_swinb_384_train.sh

Citation

If you find this repo useful for your research, please cite

@inproceedings{huang2025revisiting,
  title={Revisiting Audio-Visual Segmentation with Vision-Centric Transformer},
  author={Huang, Shaofei and Ling, Rui and Hui, Tianrui and Li, Hongyu and Zhou, Xu and Zhang, Shifeng and Liu, Si and Hong, Richang and Wang, Meng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

Contact

For questions about our paper or code, please contact Shaofei Huang(nowherespyfly@gmail.com).

Acknowledgement

This repo is mostly derived from COMBO-AVS codebase. Thanks for their efforts.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
avs_tools		avs_tools
configs		configs
images		images
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md
pred.py		pred.py
requirements.txt		requirements.txt
train_net.py		train_net.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCT_AVS

Installation

Data Preparation

Download Pretrained Models

Model Zoo

Testing

Training

Citation

Contact

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VCT_AVS

Installation

Data Preparation

Download Pretrained Models

Model Zoo

Testing

Training

Citation

Contact

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages