Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

🌐 Project Page | 📑 Paper | 🤗 Models(🧨) | 🤗 Datasets | 🤗 Demo

Kiwi-Edit is a versatile video editing framework built on an MLLM encoder and a video DiT for:

instruction video editing
reference image + instruction video editing

Visualization Demos

Style

"Apply the dynamic aesthetic of abstract art to this video."

Replace

"Replace the sofa with a classic brown leather sofa with visible stitching."

Add

"Add a classic brown fedora hat to the boy's head."

Remove

"Remove the person wearing a light blue shirt and dark pants from the entire video sequence."

Background Replace

"Replace the background with a lively urban garden scene during winter."

Subject Reference

"Add a pair of iconic red heart-shaped sunglasses to the girl's face."

Background Reference

"Replace the background with a Chinese ink painting, featuring a large golden mountain peak rising above swirling clouds."

News

[2026-03-05] HuggingFace Demo released. Please check in here!
[2026-03-03] Code and model released.

Quick Start

Environment Requirements

Python 3.10 + CUDA 12.8 environment
PyTorch==2.7, Accelerate
For training: DeepSpeed, FlashAttention

Full Environment Installation

1) Prepare environment and base weights:

# Create conda environment
conda create -n diffsynth python=3.10 -y
conda activate diffsynth
# Install PyTorch 2.7 with CUDA
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .
conda install mpi4py -y
pip install deepspeed
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.57.0 huggingface-hub==0.34 wandb
# Prepare the pretrained video dit
mkdir -p models/Wan-AI/
# Please login first
hf download Wan-AI/Wan2.2-TI2V-5B --local-dir ./models/Wan-AI/Wan2.2-TI2V-5B

or

bash install_full_env.sh

2) Run a quick test on demo video:

python demo.py \
  --ckpt_path path_to_checkpoint \
  --video_path ./demo_data/video/source/0005e4ad9f49814db1d3f2296b911abf.mp4 \
  --prompt "Remove the monkey." \
  --save_path ./output/demo_output.mp4

Diffusers Inference Environment Installation

1) Prepare environment:

# Create conda environment
conda create -n diffusers python=3.10 -y
conda activate diffusers
# Install PyTorch 2.7 with CUDA
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install diffusers decord einops accelerate transformers==4.57.0 opencv-python av

or

bash install_diffusers_env.sh

Diffusers model zoo:

Model	Type	Hugging Face
`kiwi-edit-5b-instruct-only-diffusers`	Only Fintune on Instruction Data	linyq/kiwi-edit-5b-instruct-only-diffusers
`kiwi-edit-5b-reference-only-diffusers`	Only Fintune on Reference Data	linyq/kiwi-edit-5b-reference-only-diffusers
`kiwi-edit-5b-instruct-reference-diffusers`	Fintune on Instruction and Reference Data	linyq/kiwi-edit-5b-instruct-reference-diffusers

2) Run a quick test on demo video:

python diffusers_demo.py \
    --video_path ./demo_data/video/source/0005e4ad9f49814db1d3f2296b911abf.mp4 \
    --prompt "Remove the monkey." \
    --save_path output.mp4 --model_path linyq/kiwi-edit-5b-instruct-only-diffusers

Training and Evaluation

Dataset Format

All training metadata uses CSV; we provide demo data in the repo:

Image stage: src_video, tgt_video, prompt
Example: demo_data/image_demo_training_set.csv
Video stage: src_video, tgt_video, prompt
Example: demo_data/video_demo_training_set.csv
Reference-video stage: src_video, tgt_video, ref_image, prompt
Example: demo_data/video_ref_demo_training_set.csv

For full data training, please refer to DATASET.md.

Training

Use the provided scripts in scripts/. Example:

bash scripts/run_wan2.2_ti2v_5b_qwen25vl_3b_stage3_img_vid_refvid_720x1280_81f.sh

Training scripts, key parameters, and placeholder links:

Script	Model Size	Training Stage (Data)	Max Pixels	Frames	LR	Steps	Model Weights
link	Qwen2.5-VL-3B+Wan2.2-TI2V-5B	Stage 1 (Image)	1024x1024	1	1e-5	30K	link
link	Qwen2.5-VL-3B+Wan2.2-TI2V-5B	Stage 2 (Image + Video)	600x600	81	1e-5	20K	link
link	Qwen2.5-VL-3B+Wan2.2-TI2V-5B	Stage 2 (Image + Video)	720x1280	81	1e-5	20K	link
link	Qwen2.5-VL-3B+Wan2.2-TI2V-5B	Stage 3 (Image + Video + Ref-Video)	720x1280	81	5e-6	15K	link
link	Qwen2.5-VL-3B+Wan2.2-TI2V-5B	Stage 3 (Ref-Video)	720x1280	81	5e-6	30K	link

Evaluation

For benchmark inference example:

python infer.py \
  --ckpt_path path_to_ckpt \
  --bench openve \  # or refvie
  --max_frame 81 \
  --max_pixels 921600 \
  --save_dir ./infer_results/exp_name/

For score evaluation, see eval_openve_gemini.py and eval_refvie_gemini.py.

Example:

python eval_openve_gemini.py --video_paths path_to_videos

Additional Notes

Review and secure API key handling before running Gemini-based evaluation scripts.
For Diffusers conversion, see utils/convert_diffusers/README.md.
Default benchmark paths in inference scripts assume datasets are under ./benchmark/....

Acknowledgements

Kiwi-Edit builds on training framework ModelScope DiffSynth-Studio, open-sourced datasets Ditto-1M, OpenVE-3M, ReCo, GPT-Image-Edit-1.5M, NHR-Edit, reward model EditScore and image generation model Qwen-Image-Edit.

Citation

If you use our code in your work, please cite our paper:

@misc{kiwiedit,
      title={Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance}, 
      author={Yiqi Lin and Guoqiang Liang and Ziyun Zeng and Zechen Bai and Yanzhe Chen and Mike Zheng Shou},
      year={2026},
      eprint={2603.02175},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.02175}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Visualization Demos

News

Quick Start

Full Environment Installation

Diffusers Inference Environment Installation

Training and Evaluation

Dataset Format

Training

Evaluation

Additional Notes

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
demo_data		demo_data
diffsynth		diffsynth
scripts		scripts
utils/convert_diffusers		utils/convert_diffusers
webpage		webpage
.gitattributes		.gitattributes
.gitignore		.gitignore
DATASET.md		DATASET.md
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
diffusers_demo.py		diffusers_demo.py
eval_openve_gemini.py		eval_openve_gemini.py
eval_refvie_gemini.py		eval_refvie_gemini.py
index.html		index.html
install_diffusers_env.sh		install_diffusers_env.sh
install_full_env.sh		install_full_env.sh
pyproject.toml		pyproject.toml
test_benchmark.py		test_benchmark.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Visualization Demos

News

Quick Start

Full Environment Installation

Diffusers Inference Environment Installation

Training and Evaluation

Dataset Format

Training

Evaluation

Additional Notes

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages