v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound

Zhengpeng Shi^1,3, Yanpeng Zhao^{3 † ✉}, Jianqun Zhou^2,3, Yuxuan Wang⁴, Qinrong Cui⁴, Wei Bi⁴, Songchun Zhu³, Bo Zhao^{1 ✉}, Zilong Zheng^{3 ✉}
¹Shanghai Jiao Tong University; ²Wuhan University; ³Beijing Institute for General Artificial Intelligence; ⁴Independent Researcher

arXiv

Dataset Code

Abstract

AI models capable of comprehending humor hold real-world promise—for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. v-HUB comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues. We pair each video clip with rich annotations to support a variety of evaluation tasks and analyses, including a novel study of environmental sound that can enhance humor. To broaden its applicability, we construct an open-ended QA task, making v-HUB readily integrable into existing video understanding task suites. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can natively process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the promise of integrating richer modalities for complex video understanding tasks.

Tasks Definition

To comprehensively evaluate the capability of MLLMs in humor understanding, we propose three tasks that reflect different aspects of humor reasoning: Caption Matching, Humor Explanation, and Open-ended QA.

In this discriminative task, models must correctly associate videos with their corresponding captions. Unlike ordinary caption matching tasks, our design challenges MLLMs to go beyond surface-level matching and assess their ability to understand video humor that is pronounced by creative captions from a generation perspective. For each video with a creative caption, we randomly sample four descriptive captions from other videos as the distractors.

In this generative task, models must identify humor points within each video, provide coherent explanations, and reference relevant visual or auditory cues.

To further assess the fundamental understanding of video content, we generate a set of open-ended question-answer pairs for each video. These questions—automatically generated by GPT-4o and manually verified—encompass temporal, descriptive, and causal aspects. This extends the benchmark beyond humor-specific reasoning, providing a broader assessment of video reasoning skills.

Comparison

Data Statistics

Data Curation Pipeline

Experiment Results

Our results reveal several shortcomings of MLLMs:
I. Struggling to identify humorous elements when explicit cues are absent.
II. Inadequate integration of information across modalities for understanding.
III. Limited capacity for inferring subtle humor.
IV. Heavy reliance on linguistic cues for humor understanding.
V. Weakness in deriving nuanced visual cues for understanding sophisticated video humor, although incorporating audio helps with video humor understanding.

More Examples

License

v-HUB is only used for academic research. Commercial use in any form is prohibited. It contains a collection of funny videos collected from two complementary domains. Therefore, the copyright of all videos belongs to the video owners. If there is any infringement in v-HUB, please email shi_zpeng@sjtu.edu.cn and we will remove it immediately. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify v-HUB in whole or in part. You must strictly comply with the above restrictions.

Citation

@misc{shi2026vhubbenchmarkvideohumor, title={v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound}, author={Zhengpeng Shi and Yanpeng Zhao and Jianqun Zhou and Yuxuan Wang and Qinrong Cui and Wei Bi and Songchun Zhu and Bo Zhao and Zilong Zheng}, year={2026}, eprint={2509.25773}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.25773}, }