OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

Donghao Zhou^1,*, Guisheng Liu^2,*, Hao Yang², Jiatong Li^2,†, Jingyu Lin³, Xiaohu Huang⁴,
Yichen Liu², Xin Gao², Cunjian Chen³, Shilei Wen^2,§, Chi-Wing Fu¹, Pheng-Ann Heng^1,§

¹The Chinese University of Hong Kong, ²ByteDance, ³Monash University, ⁴The University of Hong Kong

^*Equal contribution, ^†Project lead, ^§Corresponding author

🔥 Updates

2026.04: Code is under internal review. Please stay tuned!
2026.04: The technical report of OmniShow is released!

🌟 Highlights

Multimodal Video Generation Model: OmniShow is an all-in-one model for Human-Object Interaction Video Generation (HOIVG) with text, reference image, audio, and pose conditioning.
Flexible Task Coverage: A single model supports R2V, RA2V, RP2V, and RAP2V generation within one coherent framework.
Enabling Broader Applications: OmniShow exhibits remarkable versatility in broader applications, such as audio-driven avatars, object swapping, and video remixing.
New Benchmark: HOIVG-Bench provides a dedicated and comprehensive benchmark for evaluating HOIVG under diverse multimodal conditions.

🚀 Introducing OmniShow

We propose OmniShow, a video generation model that unifies text, reference image, audio, and pose conditions for HOIVG, which consists of:

Unified Channel-wise Conditioning effectively injects reference image and pose cues via unified channel concatenation. It augments noisy video tokens with pseudo-frames, which are supervised by a reference reconstruction loss to preserve semantic details.
Gated Local-Context Attention ensures precise audio-visual synchronization. It packs audio features with sufficient contextual information and injects them via masked attention to align video frames with corresponding audio segments, followed by adaptive gating to stabilize early training.
Decoupled-Then-Joint Training makes the efficient utilization of heterogeneous datasets possible. We first train specialized R2V and A2V models on separate sub-task datasets, then fuse them via weight interpolation, followed by joint fine-tuning to unify multimodal capabilities.

📊 HOIVG-Bench

To systematically evaluate HOIVG under diverse multimodal conditions, we construct HOIVG-Bench, a dedicated benchmark with 135 carefully curated samples and task-specific metrics. Each sample contains a detailed text caption, a human reference image, an object reference image, semantically aligned audio, and a coherent pose sequence.

🎬 Demo

Across varied tasks, OmniShow exhibits high-fidelity reference preservation, natural motion dynamics, and precise audio-visual synchronization. Please visit the OmniShow project page for more immersive and diverse video demonstrations.

🏆 Benchmark Evaluation

OmniShow achieves overall state-of-the-art performance across various multimodal generation tasks, and it is the only model that supports the full RAP2V setting.

Reference-to-Video Generation (R2V)

Method	TA↑	FaceSim↑	NexusScore↑	AES↑	IQA↑	VQ↑	MQ↑
HunyuanCustom	7.523	0.440	0.359	0.452	0.697	10.11	5.286
HuMo-1.7B	7.087	0.647	0.333	0.441	0.723	9.76	3.406
HuMo-17B	7.949	0.843	0.346	0.448	0.726	9.97	3.685
VACE	8.413	0.759	0.368	0.457	0.722	10.72	5.442
Phantom-1.3B	8.342	0.708	0.351	0.459	0.722	10.90	5.637
Phantom-14B	8.609	0.876	0.366	0.449	0.741	10.93	5.517
OmniShow (Ours)	7.746	0.874	0.389	0.468	0.740	11.12	5.885

Reference+Audio-to-Video Generation (RA2V)

Method	TA↑	FaceSim↑	NexusScore↑	Sync-C↑	Sync-D↓	AES↑	IQA↑	VQ↑	MQ↑
HunyuanCustom	7.289	0.457	0.350	6.072	10.08	0.439	0.715	9.15	3.658
HuMo-1.7B	7.489	0.575	0.329	7.234	9.117	0.428	0.731	9.97	4.182
HuMo-17B	8.146	0.805	0.344	8.013	8.316	0.439	0.739	10.27	4.269
OmniShow (Ours)	8.093	0.810	0.369	8.612	7.608	0.465	0.742	10.86	5.554

Reference+Pose-to-Video Generation (RP2V)

Method	TA↑	FaceSim↑	NexusScore↑	AKD↓	PCK↑	AES↑	IQA↑	VQ↑	MQ↑
AnchorCrafter	2.669	0.404	0.215	0.229	0.176	0.499	0.673	8.95	4.241
VACE	7.690	0.600	0.352	0.206	0.336	0.450	0.712	10.14	5.393
OmniShow (Ours)	6.526	0.474	0.418	0.174	0.460	0.447	0.722	10.28	4.937

🔗 Citation

If you find this work useful in your research, please cite:

@article{zhou2026omnishow,
  title={OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation},
  author={Zhou, Donghao and Liu, Guisheng and Yang, Hao and Li, Jiatong and Lin, Jingyu and Huang, Xiaohu and Liu, Yichen and Gao, Xin and Chen, Cunjian and Wen, Shilei and Fu, Chi-Wing and Heng, Pheng-Ann},
  journal={arXiv preprint arXiv:2604.11804},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

🔥 Updates

🌟 Highlights

🚀 Introducing OmniShow

📊 HOIVG-Bench

🎬 Demo

🏆 Benchmark Evaluation

Reference-to-Video Generation (R2V)

Reference+Audio-to-Video Generation (RA2V)

Reference+Pose-to-Video Generation (RP2V)

🔗 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

🔥 Updates

🌟 Highlights

🚀 Introducing OmniShow

📊 HOIVG-Bench

🎬 Demo

🏆 Benchmark Evaluation

Reference-to-Video Generation (R2V)

Reference+Audio-to-Video Generation (RA2V)

Reference+Pose-to-Video Generation (RP2V)

🔗 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages