Current multimodal embedding models are widely used for image-to-image and text-to-image retrieval, but their global embeddings often miss the fine-grained cues needed for challenging retrieval tasks. QuARI tackles this by learning a query-specific linear projection of a frozen backbone embedding space. A transformer hypernetwork maps each query to both an adapted query embedding and a low-rank projection matrix that is applied to all gallery embeddings, making the adaptation cheap enough to run over millions of items. Trained with a symmetric contrastive loss and additional “semi-positive” neighbors, QuARI emphasizes subspaces that are relevant to the current query while down-weighting irrelevant directions. Experiments on ILIAS and INQUIRE show that this simple query-conditioned adaptation consistently outperforms strong baselines, including static task-adapted encoders and heavyweight re-rankers, while remaining highly efficient at inference time.
QuARI starts from global embeddings produced by a pretrained vision–language encoder such as CLIP or SigLIP2. For each query, the hypernetwork takes the query embedding as input and predicts two things: a customized query representation and a linear projection matrix that will be applied to all image embeddings in the gallery. Retrieval is then performed in this query-specific feature space using cosine similarity between the adapted query and adapted database embeddings.
To keep computation tractable and encourage generalization, the projection is constrained to be low-rank. The matrix is represented as a sum of rank-one components, and the corresponding vectors are generated from a shared bank of “column tokens” refined by a transformer encoder. Intuitively, each rank-one component defines a semantic direction that can be turned up or down depending on what the query cares about.
The query adaptation network tokenizes the projection matrix into separate banks of U- and V-tokens, which are initialized near zero and repeatedly refined. At each step, a query-conditioned control token and sinusoidal positional encodings are concatenated with the token sequence and passed through a shared transformer. Residual updates refine the tokens over multiple iterations before they are decoded into the final low-rank projection components and the adapted query embedding.
Training uses a symmetric contrastive objective over transformed text–image pairs, with additional “semi-positive” examples discovered via precomputed nearest neighbors in the backbone embedding space. These semi-positives encourage QuARI to assign higher scores to visually and semantically similar images, rather than overfitting to a single labeled target. Lightweight noise added to query embeddings during training further regularizes the model and supports strong performance for both text and image queries.
On ILIAS, QuARI substantially improves mean average precision for both image-to-image and text-to-image retrieval at 5M and 100M distractors.
Because the hypernetwork runs once per query and the learned projection is linear, QuARI can adapt millions of database embeddings very quickly. QuARI outperforms strong baselines in image-to-image and text-to-image reranking in both accuracy and efficiency.
@article{xing2025quari,
title = {QuARI: Query Adaptive Retrieval Improvement},
author = {Xing, Eric and Stylianou, Abby and Pless, Robert and Jacobs, Nathan},
journal = {The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS)},
year = {2025}
}