DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models

Wang, Yan; Wu, Yiqiang; Liu, Hao

doi:10.1007/978-981-95-5761-5_26

Yan Wang^15,16,
Yiqiang Wu^15,16 &
Hao Liu^15,16

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 16283))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

627 Accesses

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive performance across a wide range of vision-language tasks, primarily due to large-scale pretraining and image-text alignment strategies. However, whether these models genuinely possess visual semantic understanding—particularly the ability to accurately perceive and distinguish subtle semantic differences between highly similar images—remains underexplored and lacks systematic evaluation. While single-image benchmarks assess a model’s ability to interpret isolated visual content, they offer limited insight into its capacity to detect and reason about semantic deltas between nearly identical scenes—a skill crucial in real-world tasks such as surveillance and visual inspection. To fill this gap, we introduce DeltaMMEval, a structured benchmark that employs minimal yet meaningful semantic edits between image-text pairs, enabling precise evaluation of a model’s perceptual sensitivity, contrastive reasoning, and alignment consistency—capabilities not reliably assessed through single-image tasks. DeltaMMEval explicitly decomposes visual semantic differences into three hierarchical levels: scene-level, object-level, and attribute-level, facilitating structured attribution and fine-grained diagnostic analysis of model behavior. We also introduce Group Accuracy, a stricter metric that assesses model consistency across multiple contrastive decisions. Experimental results show that even top-tier closed-source models, such as GPT-4o, achieve group matching accuracies of only 76.70% on this benchmark, revealing a significant performance gap of nearly 20% points compared to human performance (95.68%). These findings highlight a substantial deficit in visual semantic understanding, especially in tasks requiring sensitivity to fine-grained semantic differences. The datasets will be released as soon as possible.

Supported in part by the National Science Foundation of China under Grant 62476147, in part by Leaders in Innovation Fellowships of Ningxia under Grant 2024GKLRLX17, in part by the Open Fund of the Key Laboratory of the Ministry of Education on Artificial Intelligence in Equipment Under Grant AAIE-2023-0403.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (Netherlands)

eBook: EUR 69.54; Price includes VAT (Netherlands)

Softcover Book: EUR 89.37; Price includes VAT (Netherlands)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating visual mathematics in multimodal LLMs: a multilingual benchmark based on the Kangaroo tests

Article Open access 12 February 2026

When Vision Becomes a Threat: Adversarial Prompt Injection via Visual Embedding Manipulation

Eyeballing Combinatorial Problems: A Case Study of Using Multimodal Large Language Models to Solve Traveling Salesman Problems

References

Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Google Scholar
Anthropic: Claude 3.7 model overview (2025). https://www.anthropic.com/index/claude-3. Accessed June 2025
Anthropic: Claude 4 (opus 4 & sonnet 4) model overview (2025). https://www.anthropic.com/news/claude-4. Released 22 May 2025. Accessed via API
Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: empowering multimodal large language model with dual-level visual knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26540–26550 (2024)
Google Scholar
Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Chen, Z., et al.: How far are we to GPT-4v? Closing the gap to commercial multimodal models with open-source suites. Sci. China Inf. Sci. 67(12), 220101 (2024)
Article Google Scholar
Chen, Z., et al.: InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)
Google Scholar
Driess, D., et al.: Palm-e: an embodied multimodal language model (2023)
Google Scholar
Gharaee, Z., et al.: A step towards worldwide biodiversity assessment: the Bioscan-1m insect dataset. Adv. Neural. Inf. Process. Syst. 36, 43593–43619 (2023)
Google Scholar
Grattafiori, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
Hurst, A., et al.: GPT-4o system card. arXiv preprint arXiv:2410.21276 (2024)
Kuang, J., et al.: Natural language understanding and inference with MLLM in visual question answering: a survey. ACM Comput. Surv. 57(8), 1–36 (2025)
Article Google Scholar
Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23, 105–115 (2009)
Article Google Scholar
Li, L., Lei, J., Gan, Z., Liu, J.: Adversarial VQA: a new benchmark for evaluating the robustness of VQA models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2042–2051 (2021)
Google Scholar
Lin, C.: Recall-oriented understudy for Gisting evaluation (rouge) (2005). Retrieved 20 Aug 2005
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural. Inf. Process. Syst. 36, 34892–34916 (2023)
Google Scholar
Lu, H., et al.: Deepseek-VL: towards real-world vision-language understanding (2024)
Google Scholar
OpenAI: GPT-4 with vision (2023). https://openai.com/index/gpt-4#vision. Accessed via ‘gpt-image-1‘ API
OpenAI: Gpt-4.1 technical overview (2024). https://openai.com. Accessed via OpenAI API
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BleU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Raj, C., Mukherjee, A., Caliskan, A., Anastasopoulos, A., Zhu, Z.: Biasdora: exploring hidden biased associations in vision-language models. arXiv preprint arXiv:2407.02066 (2024)
Team, G., et al.: Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)
Wang, J., et al.: Logo-2k+: a large-scale logo dataset for scalable logo classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6194–6201 (2020)
Google Scholar
Wang, P., et al.: Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
Wang, Y., et al.: Multimodal LLM enhanced cross-lingual cross-modal retrieval. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 8296–8305 (2024)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)
Zheng, K., et al.: DreamLIP: language-image pre-training with long captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15076, pp. 73–90. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-72649-1_5
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineering, NingXia University, Yinchuan, China
Yan Wang, Yiqiang Wu & Hao Liu
Ningxia Key Laboratory of Artificial Intelligence and Information Security for Channeling Computing Resources from the East to the West, Yinchuan, China
Yan Wang, Yiqiang Wu & Hao Liu

Authors

Yan Wang
View author publications
Search author on:PubMed Google Scholar
Yiqiang Wu
View author publications
Search author on:PubMed Google Scholar
Hao Liu
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Yiqiang Wu .

Editor information

Editors and Affiliations

University of Surrey, Guildford, UK
Josef Kittler
Shanghai Jiao Tong University, Shanghai, China
Hongkai Xiong
Nanjing University of Science and Technology, Nanjing, Jiangsu, China
Jian Yang
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Tsinghua University, Beijing, China
Jiwen Lu
Shanghai Jiao Tong University, Shanghai, China
Weiyao Lin
ShanghaiTech University, Shanghai, China
Jingyi Yu
Sun Yat-sen University, Guangzhou, China
Weishi Zheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Wu, Y., Liu, H. (2026). DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models. In: Kittler, J., et al. Pattern Recognition and Computer Vision. PRCV 2025. Lecture Notes in Computer Science, vol 16283. Springer, Singapore. https://doi.org/10.1007/978-981-95-5761-5_26

Download citation

DOI: https://doi.org/10.1007/978-981-95-5761-5_26
Published: 12 January 2026
Publisher Name: Springer, Singapore
Print ISBN: 978-981-95-5760-8
Online ISBN: 978-981-95-5761-5
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science

Keywords

Publish with us

Policies and ethics

DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models