This repository is the official implementation of "MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation (ICLR 2025)".
Akio Hayakawa1, Masato Ishii1, Takashi Shibuya1, Yuki Mitsufuji1,2
1Sony AI and 2Sony Group Corporation
We recommend using a Miniforge environment.
1. Clone the repository
git clone https://github.com/SonyResearch/MMDisCo.git2. Install prerequisites if needed
conda install -c conda-forge ffmpeg3. Install required Python libraries
cd MMDisCo
pip install -e .You need to download the weights of MM-Diffusion and VideoCrafter2 if you want to use these models as the base models. The weights of other models are automatically downloaded.
MM-Diffusion
Please follow the official instructions to download the pre-trained models.
The weights for the base model (64x64) and the super-resolution model (64x64 -> 256x256) are needed, and the downloaded weights should be placed in the ./checkpoints/mmdiffusion directory.
VideoCrafter2
Please follow the official instructions to download the pre-trained Text-to-Video model.
The downloaded weights should be placed in the ./checkpoints/video_crafter directory.
The models are available at https://huggingface.co/AkioHayakawa/MMDisCo.
Download all files and place them in the ./checkpoints/mmdisco directory.
The expected directory structure that includes all pre-trained weights is:
MMDisCo
└── checkpoints
├── mmdiffusion
│ ├── AIST++.pt
│ ├── AIST++_SR.pt
│ ├── landscape.pt
│ └── landscape_SR.pt
├── video_crafter
│ └── base_512_v2
│ └── model.ckpt
└── mmdisco
├── audioldm_animediff_vggsound.pt
├── auffusion_videocrafter2_vggsound.pt
├── mmdiffusion_aist++.pt
└── mmdiffusion_landscape.pt
Joint Audio and Video Generation Using Pre-trained Text-to-Audio and Text-to-Video Models with MMDisCo
We provide a demo script for joint audio and video generation using two pairs of base models: AudioLDM / AnimateDiff and Auffusion / VideoCrafter2.
The generation script can be run as follows:
cd scripts/
# Using AudioLDM / AnimateDiff as base models
python generate.py model=audioldm_animediff_vggsound
# Using Auffusion / VideoCrafter2 as base models
python generate.py model=auffusion_videocrafter2_vggsoundThe output videos will be placed in the scripts/output/generate/ directory by default.
We provide a demo script for generating outputs from the pre-trained joint generation model enhanced by MMDisCo's joint guidance. We support MM-Diffusion as a base model. We provide MMDisCo for models trained on the AIST++ (dance music video) and Landscape (natural scene video) datasets.
The generation script can be run as follows:
cd scripts/
# Using the model trained on AIST++
# (Use double quotes "" for the model argument depending on your shell environment.)
python generate_mmdiffusion.py "model=mmdiffusion_aist++"
# Using the model trained on Landscape
python generate_mmdiffusion.py model=mmdiffusion_landscape@inproceedings{hayakawa2025mmdisco,
title={{MMD}is{C}o: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation},
author={Akio Hayakawa and Masato Ishii and Takashi Shibuya and Yuki Mitsufuji},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025}
}