Propagation-based video inpainting using optical flow at the pixel or feature level has recently garnered significant attention. However, it has limitations such as the inaccuracy of optical flow prediction and the propagation of noise over time. These issues result in non-uniform noise and time consistency problems throughout the video, which are particularly pronounced when the removed area is large and involves substantial movement. To address these issues, we propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI). We design FFF-VDI inspired by the capabilities of pre-trained image-to-video diffusion models that can transform the first frame image into a highly natural video. To apply this to the video inpainting task, we propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code. Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video. The proposed model addresses the limitations of existing methods that rely on optical flow quality, producing much more natural and temporally consistent videos. This proposed approach is the first to effectively integrate image-to-video diffusion models into video inpainting tasks. Through various comparative experiments, we demonstrate that the proposed model can robustly handle diverse inpainting types with high quality.
To set up the repository locally, follow these steps:
- Clone the repository and navigate to the project directory:
git clone https://github.com/Hydragon516/FFF-VDI.git cd FFF-VDI - Create a new conda environment and activate it:
conda create -n fff-vdi python=3.10 conda activate fff-vdi
- Install torch and other dependencies:
pip install torch torchvision pip install -r requirements.txt
You need accelerate for model training, so you should configure accelerate based on your hardware setup. Use the following command to configure it:
accelerate configWe use the YouTube-VOS train dataset for model training. Since FFF-VDI generates random masks during training, it only requires a set of RGB images. The complete dataset directory structure is as follows:
.
└── dataset root/
└── youtube-vos/
└── JPEGImages/
├── 00a23ccf53
├── 00ad5016a4
└── ...
Modify the data path in config.yaml. You can train the model with following command:
accelerate launch train.pyWe strongly recommend an environment with 8 or more GPUs with 80GB of VRAM for training.
- Add training details
- Add DNA module
- Add long video inference code
