Yichen Liu2, Xin Gao2, Cunjian Chen3, Shilei Wen2,§, Chi-Wing Fu1, Pheng-Ann Heng1,§
- 2026.04: Code is under internal review. Please stay tuned!
- 2026.04: The technical report of OmniShow is released!
- Multimodal Video Generation Model: OmniShow is an all-in-one model for Human-Object Interaction Video Generation (HOIVG) with text, reference image, audio, and pose conditioning.
- Flexible Task Coverage: A single model supports R2V, RA2V, RP2V, and RAP2V generation within one coherent framework.
- Enabling Broader Applications: OmniShow exhibits remarkable versatility in broader applications, such as audio-driven avatars, object swapping, and video remixing.
- New Benchmark: HOIVG-Bench provides a dedicated and comprehensive benchmark for evaluating HOIVG under diverse multimodal conditions.
We propose OmniShow, a video generation model that unifies text, reference image, audio, and pose conditions for HOIVG, which consists of:
- Unified Channel-wise Conditioning effectively injects reference image and pose cues via unified channel concatenation. It augments noisy video tokens with pseudo-frames, which are supervised by a reference reconstruction loss to preserve semantic details.
- Gated Local-Context Attention ensures precise audio-visual synchronization. It packs audio features with sufficient contextual information and injects them via masked attention to align video frames with corresponding audio segments, followed by adaptive gating to stabilize early training.
- Decoupled-Then-Joint Training makes the efficient utilization of heterogeneous datasets possible. We first train specialized R2V and A2V models on separate sub-task datasets, then fuse them via weight interpolation, followed by joint fine-tuning to unify multimodal capabilities.
To systematically evaluate HOIVG under diverse multimodal conditions, we construct HOIVG-Bench, a dedicated benchmark with 135 carefully curated samples and task-specific metrics. Each sample contains a detailed text caption, a human reference image, an object reference image, semantically aligned audio, and a coherent pose sequence.
Across varied tasks, OmniShow exhibits high-fidelity reference preservation, natural motion dynamics, and precise audio-visual synchronization. Please visit the OmniShow project page for more immersive and diverse video demonstrations.
OmniShow achieves overall state-of-the-art performance across various multimodal generation tasks, and it is the only model that supports the full RAP2V setting.
| Method | TA↑ | FaceSim↑ | NexusScore↑ | AES↑ | IQA↑ | VQ↑ | MQ↑ |
|---|---|---|---|---|---|---|---|
| HunyuanCustom | 7.523 | 0.440 | 0.359 | 0.452 | 0.697 | 10.11 | 5.286 |
| HuMo-1.7B | 7.087 | 0.647 | 0.333 | 0.441 | 0.723 | 9.76 | 3.406 |
| HuMo-17B | 7.949 | 0.843 | 0.346 | 0.448 | 0.726 | 9.97 | 3.685 |
| VACE | 8.413 | 0.759 | 0.368 | 0.457 | 0.722 | 10.72 | 5.442 |
| Phantom-1.3B | 8.342 | 0.708 | 0.351 | 0.459 | 0.722 | 10.90 | 5.637 |
| Phantom-14B | 8.609 | 0.876 | 0.366 | 0.449 | 0.741 | 10.93 | 5.517 |
| OmniShow (Ours) | 7.746 | 0.874 | 0.389 | 0.468 | 0.740 | 11.12 | 5.885 |
| Method | TA↑ | FaceSim↑ | NexusScore↑ | Sync-C↑ | Sync-D↓ | AES↑ | IQA↑ | VQ↑ | MQ↑ |
|---|---|---|---|---|---|---|---|---|---|
| HunyuanCustom | 7.289 | 0.457 | 0.350 | 6.072 | 10.08 | 0.439 | 0.715 | 9.15 | 3.658 |
| HuMo-1.7B | 7.489 | 0.575 | 0.329 | 7.234 | 9.117 | 0.428 | 0.731 | 9.97 | 4.182 |
| HuMo-17B | 8.146 | 0.805 | 0.344 | 8.013 | 8.316 | 0.439 | 0.739 | 10.27 | 4.269 |
| OmniShow (Ours) | 8.093 | 0.810 | 0.369 | 8.612 | 7.608 | 0.465 | 0.742 | 10.86 | 5.554 |
| Method | TA↑ | FaceSim↑ | NexusScore↑ | AKD↓ | PCK↑ | AES↑ | IQA↑ | VQ↑ | MQ↑ |
|---|---|---|---|---|---|---|---|---|---|
| AnchorCrafter | 2.669 | 0.404 | 0.215 | 0.229 | 0.176 | 0.499 | 0.673 | 8.95 | 4.241 |
| VACE | 7.690 | 0.600 | 0.352 | 0.206 | 0.336 | 0.450 | 0.712 | 10.14 | 5.393 |
| OmniShow (Ours) | 6.526 | 0.474 | 0.418 | 0.174 | 0.460 | 0.447 | 0.722 | 10.28 | 4.937 |
If you find this work useful in your research, please cite:
@article{zhou2026omnishow,
title={OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation},
author={Zhou, Donghao and Liu, Guisheng and Yang, Hao and Li, Jiatong and Lin, Jingyu and Huang, Xiaohu and Liu, Yichen and Gao, Xin and Chen, Cunjian and Wen, Shilei and Fu, Chi-Wing and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2604.11804},
year={2026}
}



