Novel contamination detection methodology for VLMs that is practical, reliable, and consistent: arXiv Link.
- (Jan 25th, 2026): Our paper has been accepted at ICLR 2026.
Our pipeline, multi-modal semantic perturbation, generates image-question pairs with the original image composition in tact, but modified slightly so that the answer is changed.
The perturbed benchmark will have a similar or lower difficulty than the original benchmark, meaning clean models that truly generalize should perform better. However, we discover that contaminated models consistently underperform, showing dramatic performance drops up to -45%.
- (Step 1) Randomly sample new answer from the original question
- (Step 2) Generate dense captions of the original image, conditioned on the question and the new answer.
- (Step 3) Provide the description as the prompt to Flux+ControlNet and generate the perturbed images.
-
To contaminate LLaVA-v1.5 and Qwen2-VL-7B, we follow the official repository and LLaMA-Factory, respectively, and fine-tune the models using the custom data that we would like to contaminate the model with.
-
To evaluate the contaminated and clean models, we use VLMEvalKit. We provide the
.tsvfiles that can be used to evaluate models on VLMEvalKit.- Update
config.pyin the original repo with your contaminated models - e.g inVLMEvalKit/config.py - Update
vlmeval/dataset/image_base.py, image_caption.py, image_mcq.pywith the.tsvpath accordingly.
- Update
-
The system prompts can be found in
prompts.py. This process can be replaced with a lightweight open-source models, as shown in the paper. -
For Flux+ControlNet, we follow the default settings from this repository. Replace the
main.pywithflux/main.py. -
Optionally, one can use a strong reasoning model, such as o3 to bypass manual filtering. Refer to
prompts.py.
We release the .tsv files that can be used to evaluate models on perturbed RealWorldQA and MMStar in ./tsv. The perturbed images can be downloaded from release v1.0.0.
@article{park2025vlmcont,
title={Contamination Detection for VLMs using Multi-Modal Semantic Perturbation},
author={Jaden Park and Mu Cai and Feng Yao and Jingbo Shang and Soochahn Lee and Yong Jae Lee},
journal={International Conference on Learning Representations},
year={2026},
}

