MMDisCo (ICLR 2025)

This repository is the official implementation of "MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation (ICLR 2025)".

Akio Hayakawa¹, Masato Ishii¹, Takashi Shibuya¹, Yuki Mitsufuji^1,2

¹Sony AI and ²Sony Group Corporation

Installation

We recommend using a Miniforge environment.

1. Clone the repository

git clone https://github.com/SonyResearch/MMDisCo.git

2. Install prerequisites if needed

conda install -c conda-forge ffmpeg

3. Install required Python libraries

cd MMDisCo
pip install -e .

Downloading Pre-trained Weights

Base Models

You need to download the weights of MM-Diffusion and VideoCrafter2 if you want to use these models as the base models. The weights of other models are automatically downloaded.

MM-Diffusion

Please follow the official instructions to download the pre-trained models. The weights for the base model (64x64) and the super-resolution model (64x64 -> 256x256) are needed, and the downloaded weights should be placed in the ./checkpoints/mmdiffusion directory.

VideoCrafter2

Please follow the official instructions to download the pre-trained Text-to-Video model. The downloaded weights should be placed in the ./checkpoints/video_crafter directory.

MMDisCo

The models are available at https://huggingface.co/AkioHayakawa/MMDisCo. Download all files and place them in the ./checkpoints/mmdisco directory.

The expected directory structure that includes all pre-trained weights is:

MMDisCo
└── checkpoints
    ├── mmdiffusion
    │   ├── AIST++.pt
    │   ├── AIST++_SR.pt
    │   ├── landscape.pt
    │   └── landscape_SR.pt
    ├── video_crafter
    │   └── base_512_v2
    │       └── model.ckpt
    └── mmdisco
        ├── audioldm_animediff_vggsound.pt
        ├── auffusion_videocrafter2_vggsound.pt
        ├── mmdiffusion_aist++.pt
        └── mmdiffusion_landscape.pt

Generating Video and Audio Using a Pre-trained Model

Joint Audio and Video Generation Using Pre-trained Text-to-Audio and Text-to-Video Models with MMDisCo

We provide a demo script for joint audio and video generation using two pairs of base models: AudioLDM / AnimateDiff and Auffusion / VideoCrafter2.

The generation script can be run as follows:

cd scripts/

# Using AudioLDM / AnimateDiff as base models
python generate.py model=audioldm_animediff_vggsound

# Using Auffusion / VideoCrafter2 as base models
python generate.py model=auffusion_videocrafter2_vggsound

The output videos will be placed in the scripts/output/generate/ directory by default.

MMDisCo as Joint Guidance for the Pre-trained Joint Generation Model

We provide a demo script for generating outputs from the pre-trained joint generation model enhanced by MMDisCo's joint guidance. We support MM-Diffusion as a base model. We provide MMDisCo for models trained on the AIST++ (dance music video) and Landscape (natural scene video) datasets.

The generation script can be run as follows:

cd scripts/

# Using the model trained on AIST++
# (Use double quotes "" for the model argument depending on your shell environment.)
python generate_mmdiffusion.py "model=mmdiffusion_aist++"

# Using the model trained on Landscape
python generate_mmdiffusion.py model=mmdiffusion_landscape

Citation

@inproceedings{hayakawa2025mmdisco,
title={{MMD}is{C}o: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation},
author={Akio Hayakawa and Masato Ishii and Takashi Shibuya and Yuki Mitsufuji},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
checkpoints		checkpoints
conf		conf
mmdisco		mmdisco
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MMDisCo (ICLR 2025)

Installation

Downloading Pre-trained Weights

Base Models

MMDisCo

Generating Video and Audio Using a Pre-trained Model

Joint Audio and Video Generation Using Pre-trained Text-to-Audio and Text-to-Video Models with MMDisCo

MMDisCo as Joint Guidance for the Pre-trained Joint Generation Model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SonyResearch/MMDisCo

Folders and files

Latest commit

History

Repository files navigation

MMDisCo (ICLR 2025)

Installation

Downloading Pre-trained Weights

Base Models

MMDisCo

Generating Video and Audio Using a Pre-trained Model

Joint Audio and Video Generation Using Pre-trained Text-to-Audio and Text-to-Video Models with MMDisCo

MMDisCo as Joint Guidance for the Pre-trained Joint Generation Model

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages