To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel video humor understanding benchmark. It comprises a curated collection of non-verbal short videos, reflecting real-world scenarios where humor can be appreciated purely through visual cues.
We deploy the Whisper model and only retain videos with less than 10 characters.
python ./filter/extract_speech_text.pyOur annotation platform is Label Studio, please refer to Annotation_Manual and Label Studio for setting up the platform.
Step 1: Get the Code and Data
git clone https://github.com/spatigen/vhub.git
cd vhub
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/Foreverskyou/v-HUBStep 2: Configure and Run
-
Prepare Data: Unzip the
all_data.zipfile located in the dataset directory you just cloned. This will create anall_datafolder. -
Update Paths: Open the evaluation script you wish to use (e.g.,
./scripts/Text_Only/example_QA.sh). Update theVIDEO_DIR,QUESTIONS_CSVandCAND_FILEvariables to the absolute paths of your dataset files. -
Run Evaluation: After updating variables and installing the necessary dependencies for the model, try to execute the script.
./scripts/Text_Only/example_QA.shHere we provide example scripts for the three tasks under the three settings: Text-Only, Video-Only, and Video+Audio.
You can specify different tasks, such as: ['QA','explanation','matching']. And you can also specify different models, for example:['Qwen2.5-Omni','Qwen2.5-VL','Gemini2.5-flash','GPT-4o','InterVL 3.5','Minicpm 2.6-o','video SALMONN 2']
If you have any questions, please feel free to contact us:
v-HUB is only used for academic research. Commercial use in any form is prohibited.
It contains a collection of funny videos collected from two complementary domains.
Therefore, the copyright of all videos belongs to the video owners.
If there is any infringement in v-HUB, please email shi_zpeng@sjtu.edu.cn, and we will remove it immediately.
Without prior approval, you cannot distribute, publish, copy, disseminate, or modify v-HUB in whole or in part.
You must strictly comply with the above restrictions.
Please send an email to shi_zpeng@sjtu.edu.cn.
If you find our work helpful for your research, please consider citing our work.
@misc{shi2026vhubbenchmarkvideohumor,
title={v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound},
author={Zhengpeng Shi and Yanpeng Zhao and Jianqun Zhou and Yuxuan Wang and Qinrong Cui and Wei Bi and Songchun Zhu and Bo Zhao and Zilong Zheng},
year={2026},
eprint={2509.25773},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.25773},
}

