Abstract
Multimodal large language models (MLLMs) have demonstrated impressive performance across a wide range of vision-language tasks, primarily due to large-scale pretraining and image-text alignment strategies. However, whether these models genuinely possess visual semantic understanding—particularly the ability to accurately perceive and distinguish subtle semantic differences between highly similar images—remains underexplored and lacks systematic evaluation. While single-image benchmarks assess a model’s ability to interpret isolated visual content, they offer limited insight into its capacity to detect and reason about semantic deltas between nearly identical scenes—a skill crucial in real-world tasks such as surveillance and visual inspection. To fill this gap, we introduce DeltaMMEval, a structured benchmark that employs minimal yet meaningful semantic edits between image-text pairs, enabling precise evaluation of a model’s perceptual sensitivity, contrastive reasoning, and alignment consistency—capabilities not reliably assessed through single-image tasks. DeltaMMEval explicitly decomposes visual semantic differences into three hierarchical levels: scene-level, object-level, and attribute-level, facilitating structured attribution and fine-grained diagnostic analysis of model behavior. We also introduce Group Accuracy, a stricter metric that assesses model consistency across multiple contrastive decisions. Experimental results show that even top-tier closed-source models, such as GPT-4o, achieve group matching accuracies of only 76.70% on this benchmark, revealing a significant performance gap of nearly 20% points compared to human performance (95.68%). These findings highlight a substantial deficit in visual semantic understanding, especially in tasks requiring sensitivity to fine-grained semantic differences. The datasets will be released as soon as possible.
Supported in part by the National Science Foundation of China under Grant 62476147, in part by Leaders in Innovation Fellowships of Ningxia under Grant 2024GKLRLX17, in part by the Open Fund of the Key Laboratory of the Ministry of Education on Artificial Intelligence in Equipment Under Grant AAIE-2023-0403.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)
Anthropic: Claude 3.7 model overview (2025). https://www.anthropic.com/index/claude-3. Accessed June 2025
Anthropic: Claude 4 (opus 4 & sonnet 4) model overview (2025). https://www.anthropic.com/news/claude-4. Released 22 May 2025. Accessed via API
Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: empowering multimodal large language model with dual-level visual knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26540–26550 (2024)
Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)
Chen, Z., et al.: How far are we to GPT-4v? Closing the gap to commercial multimodal models with open-source suites. Sci. China Inf. Sci. 67(12), 220101 (2024)
Chen, Z., et al.: InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)
Driess, D., et al.: Palm-e: an embodied multimodal language model (2023)
Gharaee, Z., et al.: A step towards worldwide biodiversity assessment: the Bioscan-1m insect dataset. Adv. Neural. Inf. Process. Syst. 36, 43593–43619 (2023)
Grattafiori, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)
Hurst, A., et al.: GPT-4o system card. arXiv preprint arXiv:2410.21276 (2024)
Kuang, J., et al.: Natural language understanding and inference with MLLM in visual question answering: a survey. ACM Comput. Surv. 57(8), 1–36 (2025)
Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23, 105–115 (2009)
Li, L., Lei, J., Gan, Z., Liu, J.: Adversarial VQA: a new benchmark for evaluating the robustness of VQA models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2042–2051 (2021)
Lin, C.: Recall-oriented understudy for Gisting evaluation (rouge) (2005). Retrieved 20 Aug 2005
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural. Inf. Process. Syst. 36, 34892–34916 (2023)
Lu, H., et al.: Deepseek-VL: towards real-world vision-language understanding (2024)
OpenAI: GPT-4 with vision (2023). https://openai.com/index/gpt-4#vision. Accessed via ‘gpt-image-1‘ API
OpenAI: Gpt-4.1 technical overview (2024). https://openai.com. Accessed via OpenAI API
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BleU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)
Raj, C., Mukherjee, A., Caliskan, A., Anastasopoulos, A., Zhu, Z.: Biasdora: exploring hidden biased associations in vision-language models. arXiv preprint arXiv:2407.02066 (2024)
Team, G., et al.: Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)
Wang, J., et al.: Logo-2k+: a large-scale logo dataset for scalable logo classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6194–6201 (2020)
Wang, P., et al.: Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
Wang, Y., et al.: Multimodal LLM enhanced cross-lingual cross-modal retrieval. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 8296–8305 (2024)
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)
Zheng, K., et al.: DreamLIP: language-image pre-training with long captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15076, pp. 73–90. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-72649-1_5
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2026 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Wu, Y., Liu, H. (2026). DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models. In: Kittler, J., et al. Pattern Recognition and Computer Vision. PRCV 2025. Lecture Notes in Computer Science, vol 16283. Springer, Singapore. https://doi.org/10.1007/978-981-95-5761-5_26
Download citation
DOI: https://doi.org/10.1007/978-981-95-5761-5_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-95-5760-8
Online ISBN: 978-981-95-5761-5
eBook Packages: Computer ScienceComputer Science (R0)Springer Nature Proceedings Computer Science
