Mingtao Guo1 Guanyu Xing2 Yanci Zhang3 Yanli Liu1,3
1 National Key Laboratory of Fundamental Science on Synthetic Vision, Sichuan University, Chengdu, China
2 School of Cyber Science and Engineering, Sichuan University, Chengdu, China
3 College of Computer Science, Sichuan University, Chengdu, China
To replicate the main results (as shown in the Fig. 2), please follow the steps below:
You may modify the source image and driving video paths in inference.py to test with your own inputs.
resources/source1.png--resources/driving1.mp4
resources/source2.png--resources/driving2.mp4
resources/source3.png--resources/driving3.mp4
resources/source4.png--resources/driving4.mp4
resources/source5.png--resources/driving5.mp4
Hardware Requirements
- GPU: NVIDIA RTX 4090 or equivalent
- VRAM: At least 12 GB recommended
- Inference Time: Approximately 4 minutes per 100-frame video on an RTX 4090
We are going to make all the following contents available:
- Model inference code
- Model checkpoint
- Training code
- Clone this repo locally:
git clone https://github.com/MingtaoGuo/Face-Reenactment-Video-Diffusion
cd Face-Reenactment-Video-Diffusion- Install the dependencies:
sudo apt update
sudo apt install unzip
sudo apt install git-lfs
conda create -n frvd python=3.8
conda activate frvd- Install packages for inference:
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txtmkdir pretrained_weights
mkdir pretrained_weights/checkpoint-30000-14frames
mkdir pretrained_weights/facecropper
mkdir pretrained_weights/liveportrait
git-lfs install
git clone https://huggingface.co/MartinGuo/Face-Reenactment-Video-Diffusion
mv Face-Reenactment-Video-Diffusion/head_embedder.pth pretrained_weights/checkpoint-30000-14frames
mv Face-Reenactment-Video-Diffusion/warping_feature_mapper.pth pretrained_weights/checkpoint-30000-14frames
mv Face-Reenactment-Video-Diffusion/insightface pretrained_weights/facecropper
mv Face-Reenactment-Video-Diffusion/landmark.onnx pretrained_weights/facecropper
mv Face-Reenactment-Video-Diffusion/appearance_feature_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/motion_extractor.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/spade_generator.pth pretrained_weights/liveportrait
mv Face-Reenactment-Video-Diffusion/warping_module.pth pretrained_weights/liveportrait
git clone https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt
mv stable-video-diffusion-img2vid-xt pretrained_weights
git clone https://huggingface.co/stabilityai/sd-vae-ft-mse
mv sd-vae-ft-mse pretrained_weights/stable-video-diffusion-img2vid-xtThe weights will be saved in the ./pretrained_weights directory. Please note that the download process may take a significant amount of time.
Once completed, the weights should be arranged in the following structure:
./pretrained_weights/
|-- checkpoint-30000-14frames
| |-- warping_feature_mapper.pth
| |-- head_embedder.pth
|-- facecropper
| |-- insightface
| |-- landmark.onnx
|-- liveportrait
| |-- appearance_feature_extractor.pth
| |-- motion_extractor.pth
| |-- spade_generator.pth
| |-- warping_module.pth
|-- stable-video-diffusion-img2vid-xt
|-- sd-vae-ft-mse
| |-- config.json
| |-- diffusion_pytorch_model.bin
|-- feature_extractor
| |-- preprocessor_config.json
|-- scheduler
| |-- scheduler_config.json
|-- model_index.json
|-- unet
| |-- config.json
| |-- diffusion_pytorch_model.safetensors
| |-- diffusion_pytorch_model.fp16.safetensors
|-- image_encoder
| |-- config.json
| |-- model.safetensors
| |-- model.fp16.safetensors
python inference.pyAfter running inference.py you'll get the results:
git-lfs install
git clone https://huggingface.co/datasets/MartinGuo/TalkingHeadVideo
cd TalkingHeadVideo
unzip CelebV-HQ-crop-liveportrait.zip
unzip VFHQ-video-liveportrait.zipThe datasets will be saved in the ./TalkingHeadVideo directory. Please note that the download process may take a significant amount of time.
Once completed, the datasets should be arranged in the following structure:
./TalkingHeadVideo/
|-- CelebV-HQ-crop-liveportrait
| |-- hk9jXpszz0I_2_0.mp4
| |-- _msjEt4-jZc_0.mp4
...
|-- VFHQ-video-liveportrait
| |-- Clip+HKb2I-q2k2E+P0+C1+F3658-3845_12001.mp4
| |-- Clip+HKb2I-q2k2E+P0+C0+F991-1129_7612.mp4
...
train.py require approximately 42 GB of GPU memory. The proposed method was trained on a single A6000 GPU for about six days.
python train.py We first thank to the contributors to the StableVideoDiffusion, SVD_Xtend and MimicMotion repositories, for their open research and exploration. Furthermore, our repo incorporates some codes from LivePortrait and InsightFace, and we extend our thanks to them as well.
This project is licensed under the MIT License. See the LICENSE file for details.


