GitHub - YenanLiu/DDESeg: The code for DDESeg [CVPR25].

DDESeg

This repository provides the PyTorch implementation for the CVPR2025 paper "Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics". [Arxiv]

Approach

DDESeg reconstructs the semantic content of the mixed audio signal by enriching the distinct semantic information of each individual source, deriving representations that preserve the unique characteristics of each sound. To reduce the matching difficulty, we introduce a discriminative feature learning module, which enhances the semantic distinctiveness of generated audio representations. Considering that not all derived audio representations directly correspond to visual features (e.g., off-screen sounds), we propose a dynamic elimination module to filter out non-matching elements. This module facilitates targeted interaction between sounding regions and relevant audio semantics. By scoring the interacted features, we identify and filter out irrelevant audio information, ensuring accurate audio-visual alignment.

1. Environment Preparation

Instructions for preparing your environment to run the code:

   $ conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0  pytorch-cuda=11.8 -c pytorch -c nvidia
   $ conda env create -f ddeseg_conda.yml

2. Data Preparation

We reorganized the directory structures for AVSS and VPO to unify the data loading process. To help you get started quickly, you can directly utilize our reorganized dataset here. Additionally, you can download the original datasets from the AVSS and VPO repositories and implement your own data loader.

To facilitate a quick start, I newly generate the audio memory bank and bounding box files (all in zip files) for the sounding objects (the bbox file is used for data augmentation constraints). The audio memory bank is constructed using K-Means (details are provided in our paper). You have the option to generate your own audio memory bank and adjust the number of clusters for better performance.

3. Pretrained Model Preparation

We provide the pretrained visual and audio backbones here. The visual backbone for DDESeg is trained on ImageNet, while the audio backbone is trained on AudioSet. Additionally, you can experiment with other versions and replace the backbones as needed.

4. Training Instructions

First [Necessary], replace all data file paths in the code with your own saving root.

Second [Optional], adjust the hyperparameters in configs/ddeseg.yaml and runner.py as needed.

Tip: We observed that using a larger batch size can improve performance. This phenomenon occurs because a larger batch size helps ensure more stable training, especially considering that the AVS datasets contain completely silent cases.

AVSS Setting

python runner.py -ddp -wandb -wn "avss_train" -nc 71 --task 'v2' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

AVS-Object-V1s (s4) Setting

python runner.py -ddp -wandb -wn "avs_v1s_train" -nc 1 --task 'v1s' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

AVS-Object-V1m (MS3) Setting

python runner.py -ddp -wandb -wn "avs_v1m_train" -nc 1 --task 'v1m' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

VPO-SS Setting

python runner.py -ddp -wandb -wn "VPO-SS_train" -nc 22 --task 'VPO-SS' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

VPO-MS Setting

python runner.py -ddp -wandb -wn "VPO-MS_train" -nc 22 --task 'VPO-MS' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

VPO-MSMI Setting

python runner.py -ddp -wandb -wn "VPO-MSMI_train" -nc 22 --task 'VPO-MSMI' --cfg './configs/ddeseg.yaml' -train_b 50 -val_b 1

5. Citation

@inproceedings{liu2025dynamic,
  title={Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics},
  author={Liu, Chen and Yang, Liying and Li, Peike and Wang, Dadong and Li, Lincheng and Yu, Xin},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={3131--3141},
  year={2025}
}
@misc{liu2025dynamicderivationeliminationaudio,
      title={Dynamic Derivation and Elimination: Audio Visual Segmentation with Enhanced Audio Semantics}, 
      author={Chen Liu and Liying Yang and Peike Li and Dadong Wang and Lincheng Li and Xin Yu},
      year={2025},
      eprint={2503.12840},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2503.12840}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
__pycache__		__pycache__
configs		configs
dataload		dataload
models		models
util_tools		util_tools
LICENSE		LICENSE
README.md		README.md
ddeseg.png		ddeseg.png
ddeseg_conda.yaml		ddeseg_conda.yaml
ddeseg_conda.yml		ddeseg_conda.yml
label2idx.json		label2idx.json
runner.py		runner.py
tools.py		tools.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DDESeg

Approach

1. Environment Preparation

2. Data Preparation

3. Pretrained Model Preparation

4. Training Instructions

AVSS Setting

AVS-Object-V1s (s4) Setting

AVS-Object-V1m (MS3) Setting

VPO-SS Setting

VPO-MS Setting

VPO-MSMI Setting

5. Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DDESeg

Approach

1. Environment Preparation

2. Data Preparation

3. Pretrained Model Preparation

4. Training Instructions

AVSS Setting

AVS-Object-V1s (s4) Setting

AVS-Object-V1m (MS3) Setting

VPO-SS Setting

VPO-MS Setting

VPO-MSMI Setting

5. Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages