Video Diffusion Transformers are In-Context Learners

Note: This branch is support HunyuanVideo for in-context learning.

🔭 Introduction

TL;DR: We explores in-context capabilities in video diffusion transformers, with minimal tuning to activate them.

Prompt	Generated video
Four video storyboards. [1] The video captures a serene countryside scene where a group of people are riding black horses. [2] The video captures a serene rural scene where a group of people are riding horses on a dirt road. [3] The video captures a serene rural scene where a person is riding a dark-colored horse along a dirt path. [4] The video captures a serene rural scene where a woman is riding a horse on a country road.
Four video storyboards. [1] The video captures a serene autumn scene in a forest, where a group of people are riding horses along a dirt path. [2] The video captures a group of horse riders traversing a dirt road in a rural setting. [3] The video captures a group of horse riders in a grassy field, with a backdrop of distant mountains and a clear sky. [4] The video captures a serene autumnal scene in a forest, where a group of horse riders is traversing a dirt trail.
Four video storyboards of one young boy. [1] sad. [2] happy. [3] disgusted in cartoon style. [4] contempt in cartoon style.

Abstract: Following In-context-Lora, we directly concatenate both condition and target videos into a single composite video from spacial or time dimension while using natural language to define the task. It can serve as a general framework for control video generation, with task-specific fine-tuning. More encouragingly, it can create a consistent multi-scene video more than 30 seconds without any more computation burden.

For more detailed information, please read our technique report. This is a research project, and it is recommended to try advanced products:

💡 Quick Start

1. Setup repository and environment

Our environment is totally same with FastVideo and you can install by:

/env_setup.sh fastvideo

2. Download checkpoint

Download the lora checkpoint from huggingface, and put it with model path variable.

We provide the scene and human loras, which generate the cases with different prompt types in technique report.

3. Launch the inference script!

You can run with mini code as following or refer to infer.py which generate cases, after setting the path for lora.

🔧 Fine-tuning

dataset prepareing, we use in-context dataset for preprocessing

bash scripts/preprocess/preprocess_hunyuan_data.sh

training

bash scripts/finetune/finetune_hunyuan.sh

inference with fine-tuned results

bash scripts/inference/inference_hunyuan.sh

Finally, we provide speed as reference.

🔗 Acknowledgments

The codebase is based on the awesome IC-Lora, HunyuanVideo, FastVideo, and diffusers repos.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
cases		cases
fastvideo		fastvideo
scripts		scripts
README.md		README.md
requirement.txt		requirement.txt
sample_video.py		sample_video.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Diffusion Transformers are In-Context Learners

🔭 Introduction

💡 Quick Start

1. Setup repository and environment

2. Download checkpoint

3. Launch the inference script!

🔧 Fine-tuning

🔗 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Video Diffusion Transformers are In-Context Learners

🔭 Introduction

💡 Quick Start

1. Setup repository and environment

2. Download checkpoint

3. Launch the inference script!

🔧 Fine-tuning

🔗 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages