GitHub - Hydragon516/FFF-VDI: [AAAI 2025] Video Diffusion Models are Strong Video Inpainter

Video Diffusion Models are Strong Video Inpainter

Minhyeok Lee ¹ Suhwan Cho ¹ Chajin Shin ¹ Jungho Lee ¹ Sunghun Yang ¹ Sangyoun Lee ¹

¹ Yonsei University

AAAI 2025

Abstract

Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.

Overview

Installation

To set up the repository locally, follow these steps:

Clone the repository and navigate to the project directory:

git clone https://github.com/Hydragon516/FFF-VDI.git
cd FFF-VDI

Create a new conda environment and activate it:

 conda create -n fff-vdi python=3.10
 conda activate fff-vdi

Install torch and other dependencies:

pip install torch torchvision
pip install -r requirements.txt

You need accelerate for model training, so you should configure accelerate based on your hardware setup. Use the following command to configure it:

accelerate config

Datasets

We use the YouTube-VOS train dataset for model training. Since FFF-VDI generates random masks during training, it only requires a set of RGB images. The complete dataset directory structure is as follows:

.
└── dataset root/
    └── youtube-vos/
        └── JPEGImages/
            ├── 00a23ccf53
            ├── 00ad5016a4
            └── ...

Training

Modify the data path in config.yaml. You can train the model with following command:

accelerate launch train.py

We strongly recommend an environment with 8 or more GPUs with 80GB of VRAM for training.

TODO

Add training details
Add DNA module
Add long video inference code

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
FFF_VDI		FFF_VDI
RAFT		RAFT
assets		assets
comp_model		comp_model
README.md		README.md
config.yaml		config.yaml
dataset.py		dataset.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Diffusion Models are Strong Video Inpainter

Abstract

Overview

Installation

Datasets

Training

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Video Diffusion Models are Strong Video Inpainter

Abstract

Overview

Installation

Datasets

Training

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages