🔥🔥🔥Welcome to read our full paper!
teaser1.mp4
teaser2.mp4
- [2026.02.21]: Paper accepted by CVPR 2026!
- [2026.01.14]: Paper released on arxiv!
- [2025.10.22]: Special thanks to @kijai for adding MoCha to the custom ComfyUI node WanVideoWrapper!
- [2025.10.21]: Try our work with ComfyUI workflow!
- [2025.10.21]: Release the inference code.
- [2025.10.20]: Release the project page.
Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities. In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream. Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation. To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.
Step 1: Clone this repository
git clone https://github.com/Orange-3DV-Team/MoCha.git
cd MoChaStep 2: Set up the environment
# 1. Create conda environment
conda create -n MoCha python==3.10
# 2. Activate the environment
conda activate MoCha
# 3. Install pip dependencies
pip install -r requirements.txtStep 3: Download the pretrained checkpoints
- Download the pre-trained Wan2.1 models from huggingface
- Download the pre-trained MoCha checkpoint
Please download from huggingface and place it in ./checkpoints.
Step 4: Test the example videos
python inference_mocha.pyTo start your own character replacement with MoCha, the following three inputs are required:
- Source Video: The original video with the character to be replaced.
- Designation Mask for the First Frame: A mask marking the source character to be replaced in the first frame of Source Video.
- Reference Images: Reference Images of the new character for replacement with clean background. We recommend uploading at least one high-quality, front-facing facial close-up.
start_MoCha.mp4
Then organize your test data following the structure of the ./data/test_data.csv.
source_video: Path to Source Video.source_mask: Path to Designation Mask.reference_1: Path to first Reference Image.reference_2: Path to second Reference Image. This image needs to be a high-quality, front-facing facial close-up. (You can even zoom up your first reference image!) If you really cannot provide this reference image, leave it asNone.
Finally, test your videos by:
python inference_mocha.py --data_path path/to/your/data.csvHave more ideas? Welcome to scan the code and join the WeChat group for in-depth discussion!
Please leave us a star 🌟 and cite our repo if you find our work helpful.
@inproceedings{orange2025mocha,
title={MoCha: End-to-End Video Character Replacement without Structural Guidance},
author={Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li},
journal={arXiv preprint arXiv:2601.08587},
year={2026},
url={https://github.com/Orange-3DV-Team/MoCha}
}

