Skip to content

Correr-Zhou/OmniShow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 

Repository files navigation

OmniShow logo

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou1,*, Guisheng Liu2,*, Hao Yang2, Jiatong Li2,†, Jingyu Lin3, Xiaohu Huang4,
Yichen Liu2, Xin Gao2, Cunjian Chen3, Shilei Wen2,§, Chi-Wing Fu1, Pheng-Ann Heng1,§

1The Chinese University of Hong Kong, 2ByteDance, 3Monash University, 4The University of Hong Kong

*Equal contribution, Project lead, §Corresponding author


🔥 Updates

🌟 Highlights

  • Multimodal Video Generation Model: OmniShow is an all-in-one model for Human-Object Interaction Video Generation (HOIVG) with text, reference image, audio, and pose conditioning.
  • Flexible Task Coverage: A single model supports R2V, RA2V, RP2V, and RAP2V generation within one coherent framework.
  • Enabling Broader Applications: OmniShow exhibits remarkable versatility in broader applications, such as audio-driven avatars, object swapping, and video remixing.
  • New Benchmark: HOIVG-Bench provides a dedicated and comprehensive benchmark for evaluating HOIVG under diverse multimodal conditions.
OmniShow Overview

🚀 Introducing OmniShow

We propose OmniShow, a video generation model that unifies text, reference image, audio, and pose conditions for HOIVG, which consists of:

  1. Unified Channel-wise Conditioning effectively injects reference image and pose cues via unified channel concatenation. It augments noisy video tokens with pseudo-frames, which are supervised by a reference reconstruction loss to preserve semantic details.
  2. Gated Local-Context Attention ensures precise audio-visual synchronization. It packs audio features with sufficient contextual information and injects them via masked attention to align video frames with corresponding audio segments, followed by adaptive gating to stabilize early training.
  3. Decoupled-Then-Joint Training makes the efficient utilization of heterogeneous datasets possible. We first train specialized R2V and A2V models on separate sub-task datasets, then fuse them via weight interpolation, followed by joint fine-tuning to unify multimodal capabilities.
OmniShow Pipeline

📊 HOIVG-Bench

To systematically evaluate HOIVG under diverse multimodal conditions, we construct HOIVG-Bench, a dedicated benchmark with 135 carefully curated samples and task-specific metrics. Each sample contains a detailed text caption, a human reference image, an object reference image, semantically aligned audio, and a coherent pose sequence.

HOIVG-Bench

🎬 Demo

Across varied tasks, OmniShow exhibits high-fidelity reference preservation, natural motion dynamics, and precise audio-visual synchronization. Please visit the OmniShow project page for more immersive and diverse video demonstrations.

OmniShow Qualitative Results

🏆 Benchmark Evaluation

OmniShow achieves overall state-of-the-art performance across various multimodal generation tasks, and it is the only model that supports the full RAP2V setting.

Reference-to-Video Generation (R2V)

Method TA↑ FaceSim↑ NexusScore↑ AES↑ IQA↑ VQ↑ MQ↑
HunyuanCustom 7.523 0.440 0.359 0.452 0.697 10.11 5.286
HuMo-1.7B 7.087 0.647 0.333 0.441 0.723 9.76 3.406
HuMo-17B 7.949 0.843 0.346 0.448 0.726 9.97 3.685
VACE 8.413 0.759 0.368 0.457 0.722 10.72 5.442
Phantom-1.3B 8.342 0.708 0.351 0.459 0.722 10.90 5.637
Phantom-14B 8.609 0.876 0.366 0.449 0.741 10.93 5.517
OmniShow (Ours) 7.746 0.874 0.389 0.468 0.740 11.12 5.885

Reference+Audio-to-Video Generation (RA2V)

Method TA↑ FaceSim↑ NexusScore↑ Sync-C↑ Sync-D↓ AES↑ IQA↑ VQ↑ MQ↑
HunyuanCustom 7.289 0.457 0.350 6.072 10.08 0.439 0.715 9.15 3.658
HuMo-1.7B 7.489 0.575 0.329 7.234 9.117 0.428 0.731 9.97 4.182
HuMo-17B 8.146 0.805 0.344 8.013 8.316 0.439 0.739 10.27 4.269
OmniShow (Ours) 8.093 0.810 0.369 8.612 7.608 0.465 0.742 10.86 5.554

Reference+Pose-to-Video Generation (RP2V)

Method TA↑ FaceSim↑ NexusScore↑ AKD↓ PCK↑ AES↑ IQA↑ VQ↑ MQ↑
AnchorCrafter 2.669 0.404 0.215 0.229 0.176 0.499 0.673 8.95 4.241
VACE 7.690 0.600 0.352 0.206 0.336 0.450 0.712 10.14 5.393
OmniShow (Ours) 6.526 0.474 0.418 0.174 0.460 0.447 0.722 10.28 4.937

🔗 Citation

If you find this work useful in your research, please cite:

@article{zhou2026omnishow,
  title={OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation},
  author={Zhou, Donghao and Liu, Guisheng and Yang, Hao and Li, Jiatong and Lin, Jingyu and Huang, Xiaohu and Liu, Yichen and Gao, Xin and Chen, Cunjian and Wen, Shilei and Fu, Chi-Wing and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2604.11804},
  year={2026}
}

Releases

No releases published

Packages

 
 
 

Contributors