Note: This branch is support HunyuanVideo for in-context learning.
TL;DR: We explores in-context capabilities in video diffusion transformers, with minimal tuning to activate them.
Abstract: Following In-context-Lora, we directly concatenate both condition and target videos into a single composite video from spacial or time dimension while using natural language to define the task. It can serve as a general framework for control video generation, with task-specific fine-tuning. More encouragingly, it can create a consistent multi-scene video more than 30 seconds without any more computation burden.
For more detailed information, please read our technique report.
This is a research project, and it is recommended to try advanced products:
Our environment is totally same with FastVideo and you can install by:
/env_setup.sh fastvideo
Download the lora checkpoint from huggingface, and put it with model path variable.
We provide the scene and human loras, which generate the cases with different prompt types in technique report.
You can run with mini code as following or refer to infer.py which generate cases, after setting the path for lora.
- dataset prepareing, we use in-context dataset for preprocessing
bash scripts/preprocess/preprocess_hunyuan_data.sh
- training
bash scripts/finetune/finetune_hunyuan.sh
- inference with fine-tuned results
bash scripts/inference/inference_hunyuan.sh
Finally, we provide speed as reference.
The codebase is based on the awesome IC-Lora, HunyuanVideo, FastVideo, and diffusers repos.


