Skip to content

showlab/Kiwi-Edit

Repository files navigation

Logo Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

🌐 Project Page  | 📑 Paper  | 🤗 Models(🧨) | 🤗 Datasets | 🤗 Demo

Kiwi-Edit is a versatile video editing framework built on an MLLM encoder and a video DiT for:

  • instruction video editing
  • reference image + instruction video editing

Visualization Demos

Style

Style example gif

"Apply the dynamic aesthetic of abstract art to this video."

Replace

Replace example gif

"Replace the sofa with a classic brown leather sofa with visible stitching."

Add

Add example gif

"Add a classic brown fedora hat to the boy's head."

Remove

Remove example gif

"Remove the person wearing a light blue shirt and dark pants from the entire video sequence."

Background Replace

Background replace example gif

"Replace the background with a lively urban garden scene during winter."

Subject Reference
Reference image Subject reference example gif

"Add a pair of iconic red heart-shaped sunglasses to the girl's face."

Background Reference
Reference image Background reference example gif

"Replace the background with a Chinese ink painting, featuring a large golden mountain peak rising above swirling clouds."

News

  • [2026-03-05] HuggingFace Demo released. Please check in here!
  • [2026-03-03] Code and model released.

Quick Start

Environment Requirements

  • Python 3.10 + CUDA 12.8 environment
  • PyTorch==2.7, Accelerate
  • For training: DeepSpeed, FlashAttention

Full Environment Installation

1) Prepare environment and base weights:

# Create conda environment
conda create -n diffsynth python=3.10 -y
conda activate diffsynth
# Install PyTorch 2.7 with CUDA
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install -e .
conda install mpi4py -y
pip install deepspeed
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu12torch2.7cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install transformers==4.57.0 huggingface-hub==0.34 wandb
# Prepare the pretrained video dit
mkdir -p models/Wan-AI/
# Please login first
hf download Wan-AI/Wan2.2-TI2V-5B --local-dir ./models/Wan-AI/Wan2.2-TI2V-5B

or

bash install_full_env.sh

2) Run a quick test on demo video:

python demo.py \
  --ckpt_path path_to_checkpoint \
  --video_path ./demo_data/video/source/0005e4ad9f49814db1d3f2296b911abf.mp4 \
  --prompt "Remove the monkey." \
  --save_path ./output/demo_output.mp4

Diffusers Inference Environment Installation

1) Prepare environment:

# Create conda environment
conda create -n diffusers python=3.10 -y
conda activate diffusers
# Install PyTorch 2.7 with CUDA
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/cu128
pip install diffusers decord einops accelerate transformers==4.57.0 opencv-python av

or

bash install_diffusers_env.sh

Diffusers model zoo:

Model Type Hugging Face
kiwi-edit-5b-instruct-only-diffusers Only Fintune on Instruction Data linyq/kiwi-edit-5b-instruct-only-diffusers
kiwi-edit-5b-reference-only-diffusers Only Fintune on Reference Data linyq/kiwi-edit-5b-reference-only-diffusers
kiwi-edit-5b-instruct-reference-diffusers Fintune on Instruction and Reference Data linyq/kiwi-edit-5b-instruct-reference-diffusers

2) Run a quick test on demo video:

python diffusers_demo.py \
    --video_path ./demo_data/video/source/0005e4ad9f49814db1d3f2296b911abf.mp4 \
    --prompt "Remove the monkey." \
    --save_path output.mp4 --model_path linyq/kiwi-edit-5b-instruct-only-diffusers

Training and Evaluation

Dataset Format

All training metadata uses CSV; we provide demo data in the repo:

  • Image stage: src_video, tgt_video, prompt
    Example: demo_data/image_demo_training_set.csv
  • Video stage: src_video, tgt_video, prompt
    Example: demo_data/video_demo_training_set.csv
  • Reference-video stage: src_video, tgt_video, ref_image, prompt
    Example: demo_data/video_ref_demo_training_set.csv

For full data training, please refer to DATASET.md.

Training

Use the provided scripts in scripts/. Example:

bash scripts/run_wan2.2_ti2v_5b_qwen25vl_3b_stage3_img_vid_refvid_720x1280_81f.sh

Training scripts, key parameters, and placeholder links:

Script Model Size Training Stage (Data) Max Pixels Frames LR Steps Model Weights
link Qwen2.5-VL-3B+Wan2.2-TI2V-5B Stage 1 (Image) 1024x1024 1 1e-5 30K link
link Qwen2.5-VL-3B+Wan2.2-TI2V-5B Stage 2 (Image + Video) 600x600 81 1e-5 20K link
link Qwen2.5-VL-3B+Wan2.2-TI2V-5B Stage 2 (Image + Video) 720x1280 81 1e-5 20K link
link Qwen2.5-VL-3B+Wan2.2-TI2V-5B Stage 3 (Image + Video + Ref-Video) 720x1280 81 5e-6 15K link
link Qwen2.5-VL-3B+Wan2.2-TI2V-5B Stage 3 (Ref-Video) 720x1280 81 5e-6 30K link

Evaluation

For benchmark inference example:

python infer.py \
  --ckpt_path path_to_ckpt \
  --bench openve \  # or refvie
  --max_frame 81 \
  --max_pixels 921600 \
  --save_dir ./infer_results/exp_name/

For score evaluation, see eval_openve_gemini.py and eval_refvie_gemini.py.

Example:

python eval_openve_gemini.py --video_paths path_to_videos

Additional Notes

  • Review and secure API key handling before running Gemini-based evaluation scripts.
  • For Diffusers conversion, see utils/convert_diffusers/README.md.
  • Default benchmark paths in inference scripts assume datasets are under ./benchmark/....

Acknowledgements

Kiwi-Edit builds on training framework ModelScope DiffSynth-Studio, open-sourced datasets Ditto-1M, OpenVE-3M, ReCo, GPT-Image-Edit-1.5M, NHR-Edit, reward model EditScore and image generation model Qwen-Image-Edit.

Citation

If you use our code in your work, please cite our paper:

@misc{kiwiedit,
      title={Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance}, 
      author={Yiqi Lin and Guoqiang Liang and Ziyun Zeng and Zechen Bai and Yanzhe Chen and Mike Zheng Shou},
      year={2026},
      eprint={2603.02175},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.02175}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages