🚀 Training and Inference

Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Mingtao Guo¹ Guanyu Xing² Yanci Zhang³ Yanli Liu^1,3

¹ National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China

² School of Cyber Science and Engineering, Sichuan University, Chengdu, China

³ College of Computer Science, Sichuan University, Chengdu, China

Accepted to CAD/Graphics 2025 and Recommended to Computers & Graphics Journal

Instructions for GRSI Replicability Submission

To replicate the main results (as shown in the Fig. 2), please follow the steps below:

You may modify the source image and driving video paths in inference.py to test with your own inputs.

resources/source1.png--resources/driving1.mp4
resources/source2.png--resources/driving2.mp4
resources/source3.png--resources/driving3.mp4
resources/source4.png--resources/driving4.mp4
resources/source5.png--resources/driving5.mp4

Hardware Requirements

GPU: NVIDIA RTX 4090 or equivalent
VRAM: At least 12 GB recommended
Inference Time: Approximately 4 minutes per 100-frame video on an RTX 4090

Congratulations! We have passed the GRSI reproducibility review. Here is the badge we received

📑 Todos

We are going to make all the following contents available:

Model inference code
Model checkpoint
Training code

Installation

Clone this repo locally:

git clone https://github.com/MingtaoGuo/Face-Reenactment-Video-Diffusion
cd Face-Reenactment-Video-Diffusion

Install the dependencies:

sudo apt update
sudo apt install unzip
sudo apt install git-lfs

conda create -n frvd python=3.8
conda activate frvd

Install packages for inference:

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121

pip install -r requirements.txt

Download weights

mkdir pretrained_weights
mkdir pretrained_weights/checkpoint-30000-14frames
mkdir pretrained_weights/facecropper
mkdir pretrained_weights/liveportrait
git-lfs install

git clone https://huggingface.co/MartinGuo/Face-Reenactment-Video-Diffusion
mv Face-Reenactment-Video-Diffusion/head_embedder.pth pretrained_weights/checkpoint-30000-14frames
mv Face-Reenactment-Video-Diffusion/warping_feature_mapper.pth pretrained_weights/checkpoint-30000-14frames

mv Face-Reenactment-Video-Diffusion/insightface pretrained_weights/facecropper
mv Face-Reenactment-Video-Diffusion/landmark.onnx pretrained_weights/facecropper

mv Face-Reenactment-Video-Diffusion/appearance_feature_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/motion_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/spade_generator.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/warping_module.pth pretrained_weights/liveportrait

git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
mv stable-video-diffusion-img2vid-xt pretrained_weights

git clone https://huggingface.co/stabilityai/sd-vae-ft-mse
mv sd-vae-ft-mse pretrained_weights/stable-video-diffusion-img2vid-xt

The weights will be saved in the ./pretrained_weights directory. Please note that the download process may take a significant amount of time. Once completed, the weights should be arranged in the following structure:

./pretrained_weights/
|-- checkpoint-30000-14frames
|   |-- warping_feature_mapper.pth
|   |-- head_embedder.pth
|-- facecropper
|   |-- insightface
|   |-- landmark.onnx
|-- liveportrait
|   |-- appearance_feature_extractor.pth
|   |-- motion_extractor.pth
|   |-- spade_generator.pth
|   |-- warping_module.pth
|-- stable-video-diffusion-img2vid-xt
    |-- sd-vae-ft-mse
    |   |-- config.json
    |   |-- diffusion_pytorch_model.bin
    |-- feature_extractor
    |   |-- preprocessor_config.json
    |-- scheduler
    |   |-- scheduler_config.json
    |-- model_index.json
    |-- unet
    |   |-- config.json
    |   |-- diffusion_pytorch_model.safetensors
    |   |-- diffusion_pytorch_model.fp16.safetensors
    |-- image_encoder
    |   |-- config.json
    |   |-- model.safetensors
    |   |-- model.fp16.safetensors

🚀 Training and Inference

Inference of the FRVD

⚠️ Inference WARNINGS During inference, at least 12 GB of GPU memory is required, and the memory consumption increases with the length of the driving video. When the driving video is too long, it may lead to out-of-memory errors.

python inference.py

After running inference.py you'll get the results:

Source image, 2. Driving video, 3. Reenactment result

Training of the FRVD

Download datasets

git-lfs install
git clone https://huggingface.co/datasets/MartinGuo/TalkingHeadVideo

cd TalkingHeadVideo
unzip CelebV-HQ-crop-liveportrait.zip
unzip VFHQ-video-liveportrait.zip

The datasets will be saved in the ./TalkingHeadVideo directory. Please note that the download process may take a significant amount of time. Once completed, the datasets should be arranged in the following structure:

./TalkingHeadVideo/
|-- CelebV-HQ-crop-liveportrait
|   |-- hk9jXpszz0I_2_0.mp4
|   |-- _msjEt4-jZc_0.mp4
        ...
|-- VFHQ-video-liveportrait
|   |-- Clip+HKb2I-q2k2E+P0+C1+F3658-3845_12001.mp4
|   |-- Clip+HKb2I-q2k2E+P0+C0+F991-1129_7612.mp4
        ...

⚠️ TRAINING WARNINGS In our experiments, the default training parameters in train.py require approximately 42 GB of GPU memory. The proposed method was trained on a single A6000 GPU for about six days.

python train.py

Acknowledgements

We first thank to the contributors to the StableVideoDiffusion, SVD_Xtend and MimicMotion repositories, for their open research and exploration. Furthermore, our repo incorporates some codes from LivePortrait and InsightFace, and we extend our thanks to them as well.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
assets		assets
resources		resources
src		src
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Accepted to CAD/Graphics 2025 and Recommended to Computers & Graphics Journal

Instructions for GRSI Replicability Submission

Congratulations! We have passed the GRSI reproducibility review. Here is the badge we received

📑 Todos

Installation

Download weights

🚀 Training and Inference

Inference of the FRVD

Training of the FRVD

Download datasets

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Navigating Large-Pose Challenge for High-Fidelity Face Reenactment with Video Diffusion Model

Accepted to CAD/Graphics 2025 and Recommended to Computers & Graphics Journal

Instructions for GRSI Replicability Submission

Congratulations! We have passed the GRSI reproducibility review. Here is the badge we received

📑 Todos

Installation

Download weights

🚀 Training and Inference

Inference of the FRVD

Training of the FRVD

Download datasets

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages