Skip to content

HKUST-LongGroup/SwiftI2V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SwiftI2V

Efficient High-Resolution Image-to-Video Generation
via Conditional Segment-wise Generation

YaoYang Liu1 · Yuechen Zhang2 · Wenbo Li3 · Yufei Zhao4 · Rui Liu5 · Long Chen1,*

1HKUST  ·  2CUHK  ·  3Joy Future Academy  ·  4HKU  ·  5HUAWEI Research
* Corresponding author

arXiv Project Page

SwiftI2V teaser

2K (2560×1408) image-to-video generation — 81 frames in ~111s on a single H800, and runnable on a single RTX 4090 (24 GB).


📢 News

✨ Highlights

  • 🚀 202× less GPU-time than end-to-end 2K I2V baselines
  • 🖼️ Native 2K (2560×1408) generation at 81 frames
  • 💻 Single consumer RTX 4090 (24 GB) is enough — no data-center GPU required
  • 🧩 Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction — memory-bounded regardless of video length
  • 🎯 Stage-transition training closes the train–test gap between Stage I and Stage II

📖 Overview

Image-to-video (I2V) generation has made rapid progress, yet scaling to high resolution (e.g., 2K) is bottlenecked by the efficiency–fidelity dilemma: end-to-end high-resolution generators deliver strong quality but require tens of thousands of GPU-seconds per clip, while low-resolution generation followed by video super-resolution (VSR) loses the input-image condition and hallucinates details inconsistent with the reference.

SwiftI2V is an efficient two-stage framework that resolves this dilemma:

  • Stage I produces a low-resolution motion reference with a large backbone under few-step sampling.
  • Stage II refines it to 2K, conditioned on both the input image and the Stage I output.
  • Conditional Segment-wise Generation (CSG) divides the temporal axis into bounded segments augmented by neighboring contexts, keeping peak memory roughly constant regardless of total length.
  • Stage-transition training simulates Stage I-style artifacts during Stage II training to close the train–test gap.

On VBench-I2V at 2K, SwiftI2V matches strong end-to-end baselines on key I2V metrics while reducing total GPU-time by 202×.

🏗️ Method

SwiftI2V two-stage pipeline

Overall two-stage pipeline of SwiftI2V.

Conditional Segment-wise Generation

Conditional Segment-wise Generation (CSG) with bidirectional contextual interaction.

🎥 Demo

For the full gallery, qualitative comparisons, ablations, and RTX 4090 results, please visit our Project Page.

📋 TODO

  • ✅️ Release project page
  • ✅️ Release arXiv paper
  • Release inference code
  • Release Stage I / Stage II checkpoints

📝 Citation

If you find SwiftI2V useful in your research, please consider citing:

@misc{liu2026swifti2vefficienthighresolutionimagetovideo,
      title={SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation},
      author={YaoYang Liu and Yuechen Zhang and Wenbo Li and Yufei Zhao and Rui Liu and Long Chen},
      year={2026},
      eprint={2605.06356},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.06356}
}

🙏 Acknowledgements

We gratefully acknowledge the following open-source projects that made this work possible:

  • Wan — the video generation foundation model we build upon.
  • DiffSynth-Studio — for its flexible diffusion training and inference framework.
  • LightX2V — for its efficient inference utilities.

The project page design is adapted from the Nerfies academic template (CC BY-SA 4.0).

About

Project page for paper "SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors