Skip to main content

DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2025)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 16283))

Included in the following conference series:

  • 627 Accesses

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive performance across a wide range of vision-language tasks, primarily due to large-scale pretraining and image-text alignment strategies. However, whether these models genuinely possess visual semantic understanding—particularly the ability to accurately perceive and distinguish subtle semantic differences between highly similar images—remains underexplored and lacks systematic evaluation. While single-image benchmarks assess a model’s ability to interpret isolated visual content, they offer limited insight into its capacity to detect and reason about semantic deltas between nearly identical scenes—a skill crucial in real-world tasks such as surveillance and visual inspection. To fill this gap, we introduce DeltaMMEval, a structured benchmark that employs minimal yet meaningful semantic edits between image-text pairs, enabling precise evaluation of a model’s perceptual sensitivity, contrastive reasoning, and alignment consistency—capabilities not reliably assessed through single-image tasks. DeltaMMEval explicitly decomposes visual semantic differences into three hierarchical levels: scene-level, object-level, and attribute-level, facilitating structured attribution and fine-grained diagnostic analysis of model behavior. We also introduce Group Accuracy, a stricter metric that assesses model consistency across multiple contrastive decisions. Experimental results show that even top-tier closed-source models, such as GPT-4o, achieve group matching accuracies of only  76.70% on this benchmark, revealing a significant performance gap of nearly 20% points compared to human performance (95.68%). These findings highlight a substantial deficit in visual semantic understanding, especially in tasks requiring sensitivity to fine-grained semantic differences. The datasets will be released as soon as possible.

Supported in part by the National Science Foundation of China under Grant 62476147, in part by Leaders in Innovation Fellowships of Ningxia under Grant 2024GKLRLX17, in part by the Open Fund of the Key Laboratory of the Ministry of Education on Artificial Intelligence in Equipment Under Grant AAIE-2023-0403.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Chapter
EUR 29.95
Price includes VAT (Netherlands)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR 69.54
Price includes VAT (Netherlands)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 89.37
Price includes VAT (Netherlands)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. Alayrac, J.B., et al.: Flamingo: a visual language model for few-shot learning. Adv. Neural. Inf. Process. Syst. 35, 23716–23736 (2022)

    Google Scholar 

  3. Anthropic: Claude 3.7 model overview (2025). https://www.anthropic.com/index/claude-3. Accessed June 2025

  4. Anthropic: Claude 4 (opus 4 & sonnet 4) model overview (2025). https://www.anthropic.com/news/claude-4. Released 22 May 2025. Accessed via API

  5. Chen, G., Shen, L., Shao, R., Deng, X., Nie, L.: Lion: empowering multimodal large language model with dual-level visual knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26540–26550 (2024)

    Google Scholar 

  6. Chen, X., et al.: Pali: a jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 (2022)

  7. Chen, Z., et al.: How far are we to GPT-4v? Closing the gap to commercial multimodal models with open-source suites. Sci. China Inf. Sci. 67(12), 220101 (2024)

    Article  Google Scholar 

  8. Chen, Z., et al.: InternVL: scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)

    Google Scholar 

  9. Driess, D., et al.: Palm-e: an embodied multimodal language model (2023)

    Google Scholar 

  10. Gharaee, Z., et al.: A step towards worldwide biodiversity assessment: the Bioscan-1m insect dataset. Adv. Neural. Inf. Process. Syst. 36, 43593–43619 (2023)

    Google Scholar 

  11. Grattafiori, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  12. Hurst, A., et al.: GPT-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  13. Kuang, J., et al.: Natural language understanding and inference with MLLM in visual question answering: a survey. ACM Comput. Surv. 57(8), 1–36 (2025)

    Article  Google Scholar 

  14. Lavie, A., Denkowski, M.J.: The meteor metric for automatic evaluation of machine translation. Mach. Transl. 23, 105–115 (2009)

    Article  Google Scholar 

  15. Li, L., Lei, J., Gan, Z., Liu, J.: Adversarial VQA: a new benchmark for evaluating the robustness of VQA models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2042–2051 (2021)

    Google Scholar 

  16. Lin, C.: Recall-oriented understudy for Gisting evaluation (rouge) (2005). Retrieved 20 Aug 2005

    Google Scholar 

  17. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Adv. Neural. Inf. Process. Syst. 36, 34892–34916 (2023)

    Google Scholar 

  18. Lu, H., et al.: Deepseek-VL: towards real-world vision-language understanding (2024)

    Google Scholar 

  19. OpenAI: GPT-4 with vision (2023). https://openai.com/index/gpt-4#vision. Accessed via ‘gpt-image-1‘ API

  20. OpenAI: Gpt-4.1 technical overview (2024). https://openai.com. Accessed via OpenAI API

  21. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BleU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  22. Peng, Z., et al.: Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  23. Raj, C., Mukherjee, A., Caliskan, A., Anastasopoulos, A., Zhu, Z.: Biasdora: exploring hidden biased associations in vision-language models. arXiv preprint arXiv:2407.02066 (2024)

  24. Team, G., et al.: Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  25. Wang, G., et al.: Voyager: an open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023)

  26. Wang, J., et al.: Logo-2k+: a large-scale logo dataset for scalable logo classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6194–6201 (2020)

    Google Scholar 

  27. Wang, P., et al.: Qwen2-VL: enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  28. Wang, Y., et al.: Multimodal LLM enhanced cross-lingual cross-modal retrieval. In: Proceedings of the 32nd ACM International Conference on Multimedia, pp. 8296–8305 (2024)

    Google Scholar 

  29. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019)

  30. Zheng, K., et al.: DreamLIP: language-image pre-training with long captions. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) ECCV 2024. LNCS, vol. 15076, pp. 73–90. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-72649-1_5

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiqiang Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2026 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Y., Wu, Y., Liu, H. (2026). DeltaMMEval: A Contrastive Benchmark for Fine-Grained Semantic Sensitivity in Multimodal Models. In: Kittler, J., et al. Pattern Recognition and Computer Vision. PRCV 2025. Lecture Notes in Computer Science, vol 16283. Springer, Singapore. https://doi.org/10.1007/978-981-95-5761-5_26

Download citation

Keywords

Publish with us

Policies and ethics