Skip to content

Add Patch Representation Refinement module#2678

Closed
sinahmr wants to merge 1 commit intohuggingface:mainfrom
sinahmr:locatvit-prr
Closed

Add Patch Representation Refinement module#2678
sinahmr wants to merge 1 commit intohuggingface:mainfrom
sinahmr:locatvit-prr

Conversation

@sinahmr
Copy link
Copy Markdown
Contributor

@sinahmr sinahmr commented Mar 8, 2026

This branch adds the Patch Representation Refinement (PRR) module from Locality-Attending Vision Transformer (ICLR 2026), paper, code.

PRR is a parameter-free multi-head self-attention applied before the classification head. In standard ViT, only the [CLS] token receives direct supervision from the classification loss, leaving patch representations at the final layer under-optimized for dense prediction. PRR addresses this by aggregating information from all positions non-uniformly, ensuring diverse gradient flow to spatial tokens.

Changes

  • timm/layers/prr.py (new): PRR module with support for both fused (scaled_dot_product_attention) and manual attention paths.
  • timm/layers/__init__.py: Export PRR.
  • timm/models/vision_transformer.py: Add prr parameter to VisionTransformer. When enabled, PRR is applied in forward_head before pooling. Defaults to off, so no behavioral change for existing models.

@rwightman
Copy link
Copy Markdown
Collaborator

@sinahmr this is fairly redundant when the attention pool is already there as an option isn't it?

@sinahmr
Copy link
Copy Markdown
Contributor Author

sinahmr commented Mar 9, 2026

Thanks for the feedback @rwightman.

Conceptually, PRR is not intended to improve the pooled [CLS] representation itself. Its purpose is to refine the final-layer patch tokens by allowing the classification signal to propagate to spatial tokens in a more diverse, content-dependent way during pretraining.

We discuss this in the paper in comparison to GAP as well. While GAP does route supervision to patch tokens, unlike when global_pool='token' which mainly supervises the [CLS] path, it does so uniformly across spatial locations. For dense prediction, that uniform gradient flow is not ideal. PRR instead redistributes information across tokens non-uniformly before the head. More details are provided in Section 4.2, and the empirical comparison to GAP is shown in Table 5, where PRR yields clear segmentation gains.

On the implementation side, I agree that the current integration may not be ideal. If it would be better represented as part of the pooling path, I can rework it that way, for example by treating it as an option for global_pool. Please let me know if that would fit better.

If the concern is that this is too paper-specific to justify a dedicated switch in vision_transformer.py, I understand that as well. In that case, I would be happy to close the PR.

@rwightman
Copy link
Copy Markdown
Collaborator

@sinahmr could you take a look at #2685 ... does it achieve the same end goal? I don't want to introduce high risk changes here so I feel using as an alternate pooling mechanism is more appropriate

@sinahmr
Copy link
Copy Markdown
Contributor Author

sinahmr commented Mar 17, 2026

Changes in #2685 look good to me, thank you for the implementation!
I'll close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants