Skip to content

kaist-ami/Sound2Scene

Repository files navigation

Sound2Scene (CVPR 2023) and Sound2Vision (Arxiv 2024)

This repository contains a pytorch implementation for the CVPR 2023 paper, Sound2Scene: Sound to visual scene generation by audio-to-visual latent alignment (V1), and its extended paper, Sound2Vision (V2). Sound2Scene and Sound2Vision are sound-to-image generative model which is trained solely from unlabeled videos to generate images from sound.

teaser1

image

image

Sound2Scene (CVPR 2023)

Getting started

This code was developed on Ubuntu 18.04 with Python 3.8, CUDA 11.1 and PyTorch 1.8.0. Later versions should work, but have not been tested.

Installation

Create and activate a virtual environment to work in.

conda create --name sound2scene python=3.8.8
conda activate sound2scene

Install PyTorch. For CUDA 11.1, this would look like:

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html

Install the remaining requirements with pip:

pip install -r requirements.txt

Download Models

To run Sound2Scene, you need to download image encoder (SWAV), image decoder (BigGAN) and Sound2Scene model. Download Sound2Scene | SWAV | BigGAN.

After downloading the models, place them in ./checkpoints.

./checkpoints/icgan_biggan_imagenet_res128
./checkpoints/sound2scene.pth
./checkpoints/swav.pth.tar

Highly correlated audio-visual pair dataset

We provide the annotations of the highly correlated audio-visual pairs from the VGGSound dataset.

Download top1_boxes_top10_moments.json

The annotation file contains each video name with the corresponding top 10 audio-visually correlated frame numbers.

{'9fhhMaXTraI_44000_54000': [47, 46, 45, 23, 42, 9, 44, 56, 27, 17],
'G_JwMzRLRNo_252000_262000': [2, 1, 26, 29, 15, 16, 11, 3, 14, 23], ...}

# 9fhhMaXTraI_44000_54000: video name
# [47, 46, 45, 23, 42, 9, 44, 56, 27, 17]: frame number (e.g., 47th, 46th frame, ...)
# 47th frame is the highest audio-visually correlated frame

Please follow the steps below to select a highly correlated audio-visual pair dataset.

(Step 1) Download the training dataset from VGGSound.

(Step 2) Extract the frames of each video in 10 fps.

(Step 3) Select the frame that is mentioned in the annotation file.

If you find this dataset helpful, please consider also citing: Less Can Be More: Sound Source Localization With a Classification Model.

The VEGAS dataset is available here.

Training Sound2Scene

Run below command to train the model.

We provide sample image and audio pairs in ./samples/training.

The samples are for checking the training code.

For the full dataset, please download the training dataset from VGGSound or VEGAS.

Although we provide the categories which we used (category_list), no category information were used for training.

python train.py --data_path [path containing image and audio pairs] --save_path [path for saving the checkpoints]

#or

bash train.sh

Inference

bash test.sh

Evaluating Sound2Scene

(1) We used off-the-shelf CLIP model (``Vit-B/32'') to evaluate R@k performance.

(2) We trained the Inception model on VGGSound for measuring FID and Inception score.

Sound2Vision (Arxiv 2024)

Getting started

This code was developed on Ubuntu 18.04 with Python 3.8, CUDA 11.1 and PyTorch 1.9.1. Later versions should work, but have not been tested.

Installation

Create and activate a virtual environment to work in.

conda create --name sound2scene python=3.8.8
conda activate sound2scene

Install PyTorch. For CUDA 11.1, this would look like:

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

Install the remaining requirements with pip:

pip install -r requirements.txt
pip install openai-clip==1.0.1
pip install transformers==4.28.1
pip install diffusers==0.15.0
pip install git+https://github.com/wenet-e2e/wespeaker.git

Download Models

To run Sound2Vision, you need to download Sound2Vision_env, and Sound2Vision_face. Sound2Vision_env is the model that is trained on the VGGSound dataset (environmental sound) and Sound2Vision_face is the model that is trained on the CelebV-HQ dataset (face-speech).

Download Sound2Vision_env | Sound2Vision_face

After downloading the models, place them in ./checkpoints.

./checkpoints/sound2vision_env.pth
./checkpoints/sound2vision_face.pth

Training Sound2Vision

Run below command to train the model.

We provide sample image and audio pairs in ./samples/training.

The samples are for checking the training code.

python train_sound2vision.py --data_path [path containing image and audio pairs] --save_path [path for saving the checkpoints]

Inference

#for generating images from environmental sound
pyton test_sound2vision.py --ckpt_path ./checkpoints/sound2vision_env.pth --wav_path ./samples/inference --output_path ./samples/output --input_data env

#for generating human face images from speech
pyton test_sound2vision.py --ckpt_path ./checkpoints/sound2vision_face.pth --wav_path ./samples/inference_face --output_path ./samples/output --input_data face

Citation

If you find our code or paper helpful, please consider citing:

@inproceeding{sung2023sound,
  author    = {Sung-Bin, Kim and Senocak, Arda and Ha, Hyunwoo and Owens, Andrew and Oh, Tae-Hyun},
  title     = {Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment},
  booktitle   = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2023}
}
@article{sung2024sound2vision,
  title={Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment},
  author={Sung-Bin, Kim and Senocak, Arda and Ha, Hyunwoo and Oh, Tae-Hyun},
  journal={arXiv preprint arXiv:2412.06209},
  year={2024}
}

Acknowledgment

This work was supported by IITP grant funded by Korea government (MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities). The GPU resource was supported by the HPC Support Project, MSIT and NIPA.

The implementation of Sound2Scene borrows most of the codebases from the seminal prior work, ICGAN and VGGSound. We thank the authors of these work who made their code public. Also If you find our work helpful, please consider citing them as well.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages