v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Zhengpeng Shi1,3, Yanpeng Zhao3 † ✉, Jianqun Zhou2,3, Yuxuan Wang4, Qinrong Cui4, Wei Bi4, Songchun Zhu3, Bo Zhao1 ✉, Zilong Zheng3 ✉
1Shanghai Jiao Tong University; 2Wuhan University; 3Beijing Institute for General Artificial Intelligence; 4Independent Researcher

Abstract

AI models capable of comprehending humor hold real-world promise—for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

Cover

Tasks Definition

To comprehensively evaluate the capability of MLLMs in humor understanding, we propose three tasks that reflect different aspects of humor reasoning: Caption Matching, Humor Explanation, and Open-ended QA.

In this discriminative task, models must correctly associate videos with their corresponding captions. Unlike ordinary caption matching tasks, our design challenges MLLMs to go beyond surface-level matching and assess their ability to understand video humor that is pronounced by creative captions from a generation perspective. For each video with a creative caption, we randomly sample four descriptive captions from other videos as the distractors.

Example

Comparison

Comparison

Data Statistics

Data Statistics

Data Curation Pipeline

Pipeline

Experiment Results

Results

Our results reveal several shortcomings of MLLMs:
I. Struggling to identify humorous elements when explicit cues are absent.
II. Inadequate integration of information across modalities for understanding.
III. Limited capacity for inferring subtle humor.
IV. Heavy reliance on linguistic cues for humor understanding.
V. Weakness in deriving nuanced visual cues for understanding sophisticated video humor, although incorporating audio helps with video humor understanding.

More Examples

Example 1
Example 2
Example 3

License

v-HUB is only used for academic research. Commercial use in any form is prohibited. It contains a collection of funny videos collected from two complementary domains. Therefore, the copyright of all videos belongs to the video owners. If there is any infringement in v-HUB, please email shi_zpeng@sjtu.edu.cn and we will remove it immediately. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify v-HUB in whole or in part. You must strictly comply with the above restrictions.

Citation

@misc{shi2026vhubbenchmarkvideohumor, title={v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound}, author={Zhengpeng Shi and Yanpeng Zhao and Jianqun Zhou and Yuxuan Wang and Qinrong Cui and Wei Bi and Songchun Zhu and Bo Zhao and Zilong Zheng}, year={2026}, eprint={2509.25773}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25773}, }