Refining CLIP's Spatial Awareness: A Visual-centric Perspective

TL;DR

We propose a refiner module that extracts dense, spatially-aware features directly from CLIP, enhancing region-language alignment with a visual-centric focus.

Key Features 🔍

Refiner Architecture: Refines CLIP's dense features through SSL pipeline for enhanced spatial sensitivity
SCD-Guidance: Maintains region-language matching capabilities while adding spatial awareness
Model-Agnostic Design: Verified effective on multiple CLIP variants

Introduction

Official implementation of the paper Refining CLIP's Spatial Awareness: A Visual-centric Perspective (ICLR 2025).

Refining CLIP's Spatial Awareness: A Visual-centric Perspective
Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang
[Project Page] • [arXiv Paper]

TODO

Release trained Refiner models and codes
Release fine-tuned VLMs with Refiner integration
Add SigLIP v2 support

Training Refiner on Frozen VLMs

Refiner_Dynamics.mp4

Training dynamics visualization of Refiner on EVA-CLIP

The code will be released soon, stay tuned!

Fine-tuning VLMs with Refiner

The code will be released soon, stay tuned!

License

Released under MIT License.

Citation

@article{qiu2025refining,
  title={Refining CLIP's Spatial Awareness: A Visual-Centric Perspective},
  author={Qiu, Congpei and Wu, Yanhao and Ke, Wei and Bai, Xiuxiu and Zhang, Tong},
  journal={arXiv preprint arXiv:2504.02328},
  year={2025}
}

Acknowledgement

Our code is based on CLIPSelf and closely related to OpenCLIP, EVA-CLIP and MMDetection. We sincerely thank them for their high-quality open source code!

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets/images		assets/images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refining CLIP's Spatial Awareness: A Visual-centric Perspective

TL;DR

Key Features 🔍

Introduction

TODO

Training Refiner on Frozen VLMs

Fine-tuning VLMs with Refiner

License

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Refining CLIP's Spatial Awareness: A Visual-centric Perspective

TL;DR

Key Features 🔍

Introduction

TODO

Training Refiner on Frozen VLMs

Fine-tuning VLMs with Refiner

License

Citation

Acknowledgement

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages