This repository contains the implementation for the paper "Is CLIP ideal? No. Can we fix it? Yes!" (ICCV 2025). In our work, we perform formal logical analyses on the geometry of the CLIP latent space to find fundamental limitations in representing attribute binding, spatial relationships/localization, and negation. We then introduce Dense Cosine Similarity Maps (DCSMs) and propose to train a lightweight downstream scoring network which improves upon CLIP's shortcomings. This repository contains the code to train a CNN for scoring DCSMs, as well as evaluation code for tranied models and static models like CLIP, NegCLIP, SigLIP, Coca, and BLIP.
Clone the repository:
git clone <DCSM_Ideal_CLIP>
cd DCSM_Ideal_CLIPThen install the required packages:
pip install -r requirements.txtIdeal-CLIP-DCSM-private/
├── src/
│ ├── training/
│ │ ├── main.py
│ │ ├── train.py
│ │ ├── training_util.py
│ │ └── toydataset_dense.py
│ ├── evaluating/
│ │ ├── main.py
│ │ └── run_evals.py
│ └── util/
├── whatsup_vlms/
├── data/ #currently contains img assets, but add training data here.
├── pretrained_models/ # add checkpoints for eval here
└── notebooks/
Note that whatsup_vlms is from "What's Up with Vision Language Models?" (EMNLP 2023) by Kamath et. al. We used datatsets and helper functions from this work as evaluation sets for our models.
Populate data with the training data found at this Google Drive link.
Note that make_and_return_training_data.py in src/training/training_util.py natively combines training images and captions for all 3 of CLIP's failure modes. You can mix and match them by specifying certain datasets. Also, note that per Sec. B2 of our paper, the attribute-wise opposite_image for COCO_train is missing. As such, it must be set to None and the contrastive pair should be composed only of the opposite caption and images/captions from other batch indices.
For evaluation, also add val2017 from COCO, the CVLEVR-bind dataset from this github repository, the datasets necessary for NegBench from this github repository, and the composite NCD dataset from this Google Drive link.
Download the synthetic and COCO-trained models at this Google Drive link to populate pretrained_models.
To train the model with dense cross-modal supervision, use:
python src/training/main.py \
--data_path /path/to/your/data \
--model_save_path /path/to/save/models \
--batch_size somenumber \
--learning_rate 1e-5 \
--num_epochs somenumber \
--device cudaBy default, make_and_return_train_data loads all training data. Modify the function to limit training.
To evaluate the trained model on various benchmarks, use:
python src/evaluating/main.py \
--model_path /path/to/your/model.pth \
--model_name ModelName
--root_dir /path/to/evaluation/data \
--output_path /path/to/save/results \
--device cudaAlternatively, set model_name to "all" and modify the list of models to test in main.py.
By default, get_test_dataloaders loads all datasets. Modify the function to evaluate on a select few.
The project currenty contains support for evaluating on CLEVR-bind, Natural Colors Dataset, VG-Attribution, WhatsUp, COCO-spatial, VG-Spatial, and NegBench.
If you find this useful, please consider citing our work!
@article{kang2025clip,
title={Is CLIP ideal? No. Can we fix it? Yes!},
author={Kang, Raphi and Song, Yue and Gkioxari, Georgia and Perona, Pietro},
journal={ICCV},
year={2025}
}

