GitHub - aim-uofa/GenDeF

GenDeF: Learning Generative Deformation Field for Video Generation

Wen Wang^1,2* Kecheng Zheng² Qiuyu Wang² Hao Chen^1† Zifan Shi^3,2* Ceyuan Yang⁴
Yujun Shen^2† Chunhua Shen¹

^*Intern at Ant Group ^†Corresponding Author

¹Zhejiang University ²Ant Group ³HKUST ⁴Shanghai Artificial Intellgence Laboratory

We offer a new perspective on approaching the task of video generation. Instead of directly synthesizing a sequence of frames, we propose to render a video by warping one static image with a generative deformation field (GenDeF). Such a pipeline enjoys three appealing advantages. First, we can sufficiently reuse a well-trained image generator to synthesize the static image (also called canonical image), alleviating the difficulty in producing a video and thereby resulting in better visual quality. Second, we can easily convert a deformation field to optical flows, making it possible to apply explicit structural regularizations for motion modeling, leading to temporally consistent results. Third, the disentanglement between content and motion allows users to process a synthesized video through processing its corresponding static image without any tuning, facilitating many applications like video editing, keypoint tracking, and video segmentation. Both qualitative and quantitative results on three common video generation benchmarks demonstrate the superiority of our GenDeF method.

Getting Started

Prerequisites

Python 3.8+
CUDA 11.3+
PyTorch 1.11+ and torchvision 0.12+

Installation

# Clone the repository
git clone https://github.com/aim-uofa/GenDeF.git
cd GenDeF

# Install PyTorch (adjust for your CUDA version, see https://pytorch.org/)
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 --extra-index-url https://download.pytorch.org/whl/cu113

# Install Python dependencies
pip install -r requirements.txt

# Install the project in editable mode
pip install -e .

Dataset Preparation

We support training on the following video datasets:

YouTube Driving (YTB): YouTube driving videos at 256×256 resolution
SkyTimelapse: Sky timelapse videos at 256×256 resolution
TaiChi-HD: Tai Chi videos at 256×256 resolution

Organize the dataset as a zip archive and place it in the data/ directory:

data/
  ytb_256.zip       # YouTube Driving dataset
  sky_256.zip       # SkyTimelapse dataset (optional)
  taichi_256.zip    # TaiChi-HD dataset (optional)

Each zip file should contain video frames organized as described in the StyleGAN-V dataset format.

Training

Training follows a two-stage pipeline. Below we use the TaiChi-HD dataset as an example.

Stage 1: Pretrain (Image Generation)

In this stage, we train a 2D image generator backbone with deformable convolutions. The model learns to generate single frames (i.e., num_frames_per_video=1), building a strong image generation foundation.

bash scripts/train_taichi_stage1_pretrain.sh

Key hyperparameters for Stage 1

Parameter	Value	Description
`sampling.num_frames_per_video`	1	Single-frame training
`model.generator.fmaps`	0.5	Generator feature map multiplier
`model.discriminator.fmaps`	0.5	Discriminator feature map multiplier
`model.generator.dcn`	true	Enable deformable convolution in generator
`model.discriminator.tsm`	false	Disable temporal shift module in discriminator
`model.loss_kwargs.r1_gamma`	0.5	R1 regularization weight
`model.generator.learnable_motion_mask`	false	Disable learnable motion mask
`model.generator.time_enc.min_period_len`	16	Minimum period length for time encoding
`training.aug`	ada	Adaptive augmentation
`training.batch_size`	64	Total batch size
`num_gpus`	8	Number of GPUs

Stage 2: Finetune (Video Generation with Deformation Field)

In this stage, we introduce the canonical image generation and deformation field prediction modules. The model learns to generate videos by warping a canonical image with a predicted deformation field. The pretrained checkpoint from Stage 1 is used as initialization.

bash scripts/train_taichi_stage2_finetune.sh

Key hyperparameters for Stage 2

Parameter	Value	Description
`sampling.num_frames_per_video`	3	Multi-frame training
`model.generator.fmaps`	0.5	Generator feature map multiplier
`model.discriminator.fmaps`	0.5	Discriminator feature map multiplier
`model.discriminator.tsm`	true	Enable temporal shift module
`model.loss_kwargs.r1_gamma`	8	R1 regularization weight (increased)
`model.generator.with_canonical`	true	Enable canonical image generation
`model.generator.canonical_cond`	concat	Canonical conditioning method
`model.generator.canonical_cond_dim`	64	Canonical conditioning dimension
`model.generator.canonical_feat`	L13_256_64	Feature level for canonical image
`model.generator.deform_dcn`	true	Enable DCN for deformation prediction
`model.generator.deform_dcn_min_res`	4	Min resolution for deform DCN
`model.generator.deform_dcn_max_res`	64	Max resolution for deform DCN
`model.generator.deform_dcn_torgb`	true	Enable DCN for toRGB layers
`training.resume`	Stage 1 ckpt	Resume from Stage 1 pretrained model

Key Differences between Stage 1 and Stage 2

Aspect	Stage 1 (Pretrain)	Stage 2 (Finetune)
Frames per video	1 (image-only)	3 (video)
Temporal modeling	Disabled (`tsm=false`)	Enabled (`tsm=true`)
Canonical image	Not used	Enabled
Deformation field	Not used	Enabled with DCN
R1 gamma	0.5	8.0
Learnable motion mask	false	true

Generation (Sampling)

After training, generate videos using:

bash scripts/generate_videos.sh

You can customize the generation by editing the script or passing arguments directly:

python src/scripts/generate_ours.py \
    --network_pkl output/taichi_finetune/output/best.pkl \
    --num_videos 100 \
    --save_as_mp4 true \
    --fps 25 \
    --video_len 128 \
    --batch_size 25 \
    --outdir sample/taichi \
    --truncation_psi 0.9 \
    --seed 42

Argument	Description
`--network_pkl`	Path to the trained model checkpoint (`.pkl`)
`--num_videos`	Number of videos to generate
`--video_len`	Number of frames per video
`--fps`	Frames per second for saved mp4
`--truncation_psi`	Truncation (lower = higher quality, less diversity)
`--save_as_mp4`	Save as mp4 video files
`--seed`	Random seed for reproducibility

Main Results

Applications

Video Editing

Point Tracking

Video Segmentation

Diverse Motion Generation

Acknowledgements

This codebase is built on top of StyleGAN-V. We thank the authors for their excellent work.

Citing

If you find our work useful, please consider citing:

@misc{wang2023gendef,
    title={GenDeF: Learning Generative Deformation Field for Video Generation},
    author={Wen Wang and Kecheng Zheng and Qiuyu Wang and Hao Chen and Zifan Shi and Ceyuan Yang and Yujun Shen and Chunhua Shen},
    year={2023},
    eprint={2312.04561},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenDeF: Learning Generative Deformation Field for Video Generation

Getting Started

Prerequisites

Installation

Dataset Preparation

Training

Stage 1: Pretrain (Image Generation)

Stage 2: Finetune (Video Generation with Deformation Field)

Key Differences between Stage 1 and Stage 2

Generation (Sampling)

Main Results

Applications

Video Editing

Point Tracking

Video Segmentation

Diverse Motion Generation

Acknowledgements

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenDeF: Learning Generative Deformation Field for Video Generation

Getting Started

Prerequisites

Installation

Dataset Preparation

Training

Stage 1: Pretrain (Image Generation)

Stage 2: Finetune (Video Generation with Deformation Field)

Key Differences between Stage 1 and Stage 2

Generation (Sampling)

Main Results

Applications

Video Editing

Point Tracking

Video Segmentation

Diverse Motion Generation

Acknowledgements

Citing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages