Ideal-CLIP-DCSM

This repository contains the implementation for the paper "Is CLIP ideal? No. Can we fix it? Yes!" (ICCV 2025). In our work, we perform formal logical analyses on the geometry of the CLIP latent space to find fundamental limitations in representing attribute binding, spatial relationships/localization, and negation. We then introduce Dense Cosine Similarity Maps (DCSMs) and propose to train a lightweight downstream scoring network which improves upon CLIP's shortcomings. This repository contains the code to train a CNN for scoring DCSMs, as well as evaluation code for tranied models and static models like CLIP, NegCLIP, SigLIP, Coca, and BLIP.

Preview of DCSM vs. CLIP in Spatial Understanding

Attribute Binding

Negation

Installation

Clone the repository:

git clone <DCSM_Ideal_CLIP>
cd DCSM_Ideal_CLIP

Then install the required packages:

pip install -r requirements.txt

Project Structure

Ideal-CLIP-DCSM-private/
├── src/
│   ├── training/
│   │   ├── main.py          
│   │   ├── train.py         
│   │   ├── training_util.py 
│   │   └── toydataset_dense.py 
│   ├── evaluating/
│   │   ├── main.py          
│   │   └── run_evals.py     
│   └── util/               
├── whatsup_vlms/           
├── data/                 #currently contains img assets, but add training data here.                 
├── pretrained_models/    # add checkpoints for eval here
└── notebooks/

Note that whatsup_vlms is from "What's Up with Vision Language Models?" (EMNLP 2023) by Kamath et. al. We used datatsets and helper functions from this work as evaluation sets for our models.

Data

Populate data with the training data found at this Google Drive link. Note that make_and_return_training_data.py in src/training/training_util.py natively combines training images and captions for all 3 of CLIP's failure modes. You can mix and match them by specifying certain datasets. Also, note that per Sec. B2 of our paper, the attribute-wise opposite_image for COCO_train is missing. As such, it must be set to None and the contrastive pair should be composed only of the opposite caption and images/captions from other batch indices.

For evaluation, also add val2017 from COCO, the CVLEVR-bind dataset from this github repository, the datasets necessary for NegBench from this github repository, and the composite NCD dataset from this Google Drive link.

Model Checkpoints

Download the synthetic and COCO-trained models at this Google Drive link to populate pretrained_models.

Usage

Training

To train the model with dense cross-modal supervision, use:

python src/training/main.py \
    --data_path /path/to/your/data \
    --model_save_path /path/to/save/models \
    --batch_size somenumber \
    --learning_rate 1e-5 \
    --num_epochs somenumber \
    --device cuda

By default, make_and_return_train_data loads all training data. Modify the function to limit training.

Evaluation

To evaluate the trained model on various benchmarks, use:

python src/evaluating/main.py \
    --model_path /path/to/your/model.pth \
    --model_name ModelName
    --root_dir /path/to/evaluation/data \
    --output_path /path/to/save/results \
    --device cuda

Alternatively, set model_name to "all" and modify the list of models to test in main.py. By default, get_test_dataloaders loads all datasets. Modify the function to evaluate on a select few.

Datasets

The project currenty contains support for evaluating on CLEVR-bind, Natural Colors Dataset, VG-Attribution, WhatsUp, COCO-spatial, VG-Spatial, and NegBench.

Citation

If you find this useful, please consider citing our work!

@article{kang2025clip,
  title={Is CLIP ideal? No. Can we fix it? Yes!},
  author={Kang, Raphi and Song, Yue and Gkioxari, Georgia and Perona, Pietro},
  journal={ICCV},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
notebooks		notebooks
src		src
whatsup_vlms/dataset_zoo		whatsup_vlms/dataset_zoo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ideal-CLIP-DCSM

Preview of DCSM vs. CLIP in Spatial Understanding

Attribute Binding

Negation

Installation

Project Structure

Data

Model Checkpoints

Usage

Training

Evaluation

Datasets

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Ideal-CLIP-DCSM

Preview of DCSM vs. CLIP in Spatial Understanding

Attribute Binding

Negation

Installation

Project Structure

Data

Model Checkpoints

Usage

Training

Evaluation

Datasets

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages