MoCha: End-to-End Video Character Replacement without Structural Guidance

🔥🔥🔥Welcome to read our full paper!

teaser1.mp4

teaser2.mp4

🔥 Updates

[2026.02.21]: Paper accepted by CVPR 2026!
[2026.01.14]: Paper released on arxiv!
[2025.10.22]: Special thanks to @kijai for adding MoCha to the custom ComfyUI node WanVideoWrapper!
[2025.10.21]: Try our work with ComfyUI workflow!
[2025.10.21]: Release the inference code.
[2025.10.20]: Release the project page.

📝 Abstract

Controllable video character replacement with a user-provided one remains a challenging problem due to the lack of qualified paired-video data. Prior works have predominantly adopted a reconstruction-based paradigm reliant on per-frame masks and explicit structural guidance (e.g., pose, depth). This reliance, however, renders them fragile in complex scenarios involving occlusions, rare poses, character-object interactions, or complex illumination, often resulting in visual artifacts and temporal discontinuities. In this paper, we propose MoCha, a novel framework that bypasses these limitations, which requires only a single first-frame mask and re-renders the character by unifying different conditions into a single token stream. Further, MoCha adopts a condition-aware RoPE to support multi-reference images and variable-length video generation. To overcome the data bottleneck, we construct a comprehensive data synthesis pipeline to collect qualified paired-training videos. Extensive experiments show that our method substantially outperforms existing state-of-the-art approaches.

☕ Getting Started with MoCha

Inference

Step 1: Clone this repository

git clone https://github.com/Orange-3DV-Team/MoCha.git
cd MoCha

Step 2: Set up the environment

# 1. Create conda environment
conda create -n MoCha python==3.10

# 2. Activate the environment
conda activate MoCha

# 3. Install pip dependencies
pip install -r requirements.txt

Step 3: Download the pretrained checkpoints

Download the pre-trained Wan2.1 models from huggingface
Download the pre-trained MoCha checkpoint

Please download from huggingface and place it in ./checkpoints.

Step 4: Test the example videos

python inference_mocha.py

Test your own video

To start your own character replacement with MoCha, the following three inputs are required:

Source Video: The original video with the character to be replaced.
Designation Mask for the First Frame: A mask marking the source character to be replaced in the first frame of Source Video.
Reference Images: Reference Images of the new character for replacement with clean background. We recommend uploading at least one high-quality, front-facing facial close-up.

start_MoCha.mp4

Then organize your test data following the structure of the ./data/test_data.csv.

source_video: Path to Source Video.
source_mask: Path to Designation Mask.
reference_1: Path to first Reference Image.
reference_2: Path to second Reference Image. This image needs to be a high-quality, front-facing facial close-up. (You can even zoom up your first reference image!) If you really cannot provide this reference image, leave it as None.

Finally, test your videos by:

python inference_mocha.py --data_path path/to/your/data.csv

💭 Communication & Feedback

Have more ideas? Welcome to scan the code and join the WeChat group for in-depth discussion!

🌟 Citation

Please leave us a star 🌟 and cite our repo if you find our work helpful.

@inproceedings{orange2025mocha,
  title={MoCha: End-to-End Video Character Replacement without Structural Guidance}, 
  author={Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li},
  journal={arXiv preprint arXiv:2601.08587},
  year={2026},
  url={https://github.com/Orange-3DV-Team/MoCha}
}

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
diffsynth		diffsynth
results		results
teasers		teasers
README.md		README.md
inference_mocha.py		inference_mocha.py
qrcode.png		qrcode.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoCha: End-to-End Video Character Replacement without Structural Guidance

🔥🔥🔥Welcome to read our full paper!

🔥 Updates

📝 Abstract

☕ Getting Started with MoCha

Inference

Test your own video

💭 Communication & Feedback

🌟 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoCha: End-to-End Video Character Replacement without Structural Guidance

🔥🔥🔥Welcome to read our full paper!

🔥 Updates

📝 Abstract

☕ Getting Started with MoCha

Inference

Test your own video

💭 Communication & Feedback

🌟 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages