We propose a refiner module that extracts dense, spatially-aware features directly from CLIP, enhancing region-language alignment with a visual-centric focus.
- Refiner Architecture: Refines CLIP's dense features through SSL pipeline for enhanced spatial sensitivity
- SCD-Guidance: Maintains region-language matching capabilities while adding spatial awareness
- Model-Agnostic Design: Verified effective on multiple CLIP variants
Official implementation of the paper Refining CLIP's Spatial Awareness: A Visual-centric Perspective (ICLR 2025).
Refining CLIP's Spatial Awareness: A Visual-centric Perspective
Congpei Qiu, Yanhao Wu, Wei Ke, Xiuxiu Bai, Tong Zhang
[Project Page] • [arXiv Paper]
- Release trained Refiner models and codes
- Release fine-tuned VLMs with Refiner integration
- Add SigLIP v2 support
Refiner_Dynamics.mp4
The code will be released soon, stay tuned!
The code will be released soon, stay tuned!
Released under MIT License.
@article{qiu2025refining,
title={Refining CLIP's Spatial Awareness: A Visual-Centric Perspective},
author={Qiu, Congpei and Wu, Yanhao and Ke, Wei and Bai, Xiuxiu and Zhang, Tong},
journal={arXiv preprint arXiv:2504.02328},
year={2025}
}Our code is based on CLIPSelf and closely related to OpenCLIP, EVA-CLIP and MMDetection. We sincerely thank them for their high-quality open source code!
