Efficient High-Resolution Image-to-Video Generation
via Conditional Segment-wise Generation

Efficient High-Resolution Image-to-Video Generation
via Conditional Segment-wise Generation

YaoYang Liu¹ · Yuechen Zhang² · Wenbo Li³ · Yufei Zhao⁴ · Rui Liu⁵ · Long Chen^1,*

¹HKUST · ²CUHK · ³Joy Future Academy · ⁴HKU · ⁵HUAWEI Research
_{* Corresponding author}

2K (2560×1408) image-to-video generation — 81 frames in ~111s on a single H800, and runnable on a single RTX 4090 (24 GB).

📢 News

[2026-05] 📄 arXiv preprint released: arXiv:2605.06356
[2026-05] 🌐 Project page online: hkust-longgroup.github.io/SwiftI2V
Code and checkpoints will be released in this repository — stay tuned!

✨ Highlights

🚀 202× less GPU-time than end-to-end 2K I2V baselines
🖼️ Native 2K (2560×1408) generation at 81 frames
💻 Single consumer RTX 4090 (24 GB) is enough — no data-center GPU required
🧩 Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction — memory-bounded regardless of video length
🎯 Stage-transition training closes the train–test gap between Stage I and Stage II

📖 Overview

Image-to-video (I2V) generation has made rapid progress, yet scaling to high resolution (e.g., 2K) is bottlenecked by the efficiency–fidelity dilemma: end-to-end high-resolution generators deliver strong quality but require tens of thousands of GPU-seconds per clip, while low-resolution generation followed by video super-resolution (VSR) loses the input-image condition and hallucinates details inconsistent with the reference.

SwiftI2V is an efficient two-stage framework that resolves this dilemma:

Stage I produces a low-resolution motion reference with a large backbone under few-step sampling.
Stage II refines it to 2K, conditioned on both the input image and the Stage I output.
Conditional Segment-wise Generation (CSG) divides the temporal axis into bounded segments augmented by neighboring contexts, keeping peak memory roughly constant regardless of total length.
Stage-transition training simulates Stage I-style artifacts during Stage II training to close the train–test gap.

On VBench-I2V at 2K, SwiftI2V matches strong end-to-end baselines on key I2V metrics while reducing total GPU-time by 202×.

🏗️ Method

Overall two-stage pipeline of SwiftI2V.

Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction.

🎥 Demo

For the full gallery, qualitative comparisons, ablations, and RTX 4090 results, please visit our Project Page.

📋 TODO

✅️ Release project page
✅️ Release arXiv paper
Release inference code
Release Stage I / Stage II checkpoints

📝 Citation

If you find SwiftI2V useful in your research, please consider citing:

@misc{liu2026swifti2vefficienthighresolutionimagetovideo,
      title={SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation},
      author={YaoYang Liu and Yuechen Zhang and Wenbo Li and Yufei Zhao and Rui Liu and Long Chen},
      year={2026},
      eprint={2605.06356},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.06356}
}

🙏 Acknowledgements

We gratefully acknowledge the following open-source projects that made this work possible:

Wan — the video generation foundation model we build upon.
DiffSynth-Studio — for its flexible diffusion training and inference framework.
LightX2V — for its efficient inference utilities.

The project page design is adapted from the Nerfies academic template (CC BY-SA 4.0).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
static		static
.gitignore		.gitignore
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient High-Resolution Image-to-Video Generation
via Conditional Segment-wise Generation

📢 News

✨ Highlights

📖 Overview

🏗️ Method

🎥 Demo

📋 TODO

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Efficient High-Resolution Image-to-Video Generationvia Conditional Segment-wise Generation

📢 News

✨ Highlights

📖 Overview

🏗️ Method

🎥 Demo

📋 TODO

📝 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Efficient High-Resolution Image-to-Video Generation
via Conditional Segment-wise Generation

Packages