Weikang Meng, Yadan Luo, Liangyu Huo, Yingjian Li, Yaowei Wang, Zheng Zhang*, Heng Tao Shen
🚀 Welcome to the repo of PolaFormer++!
This repo contains the official PyTorch code for PolaFormer++, the next version of PolaFormer[1].
Although Transformers have achieved remarkable success in vision tasks, the quadratic complexity of their self-attention mechanism limits scalability, motivating the development of linear attention approaches that reformulate attention computation through kernel-based feature mappings with linear complexity. However, existing linear attention models often suffer from significant performance degradation due to overly simplistic feature maps, which restrict their practical applicability. In particular, current feature map designs face two fundamental limitations: the non-negativity constraint that discards negative correlations, and the use of overly smooth kernels that increase the entropy of attention weights, thereby weakening discriminability. To address these challenges, we propose Polarity-Aware Linear Attention with Channel-wise Spikiness (PolaFormer++), which is a unified framework that captures complete query–key interactions while preserving the spiky property as in Softmax attention with a flexible channel-wise feature map.
In this paper, we present a novel polarity-aware linear attention mechanism, which accurately captures all query-key interactions in inner-product. Our method incorporates a dual learnable spiky function, that dynamically rescales the magnitudes of same and opposite-signed components, effectively reducing entropy of the attention weight in linear attention.we propose the polarity-aware linear attention mechanism that explicitly models both same-signed and opposite-signed query-key interactions, ensuring comprehensive coverage of relational information. Furthermore, to restore the spiky properties of attention maps, we prove that the existence of a class of element-wise functions (with positive first and second derivatives) can reduce entropy in the attention distribution. Finally, we employ a learnable power function for rescaling, allowing strong and weak attention signals to be effectively separated.
PolaFormer: Polarity-aware Linear Attention
Polarity-aware attention mechanism separates query-key pairs based on their polarity, computing their interactions independently. The attention weights are calculated as follows: To address this information loss, our polarity-aware attention mechanism separates query-key pairs based on their polarity, computing their interactions independently. The attention weights are calculated as follows:
To avoid the pitfalls of subtractive operations, we instead introduce a Polarity Gating Matrix
Polarity-aware Feature Map with Channel-wise Spikiness
Considering that different channels contribute unequally to feature representation, we introduce a channel-wise temperature vector
- Python 3.9
- PyTorch == 1.11.0
- torchvision == 0.12.0
- numpy
- timm == 0.4.12
- einops
- yacs
The ImageNet dataset should be prepared as follows:
$ tree data
imagenet
├── train
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── val
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
To train our model on ImageNet from scratch, see pretrain.sh and run:
bash pretrain.sh
| Model | Params | Flops | Acc |
|---|---|---|---|
| b0 | 7M | 0.7G | 76.9% |
| b1 | 9M | 1.0G | 79.1% |
| b2 | 16M | 2.7G | 82.8% |
| b3 | 27M | 5.1G | 84.3% |
| b4 | 50M | 11G | 85.4% |
| b5 | 97M | 18G | 85.6% |
[1] Meng, Weikang, Yadan Luo, Xin Li, Dongmei Jiang, and Zheng Zhang. "PolaFormer: Polarity-aware Linear Attention for Vision Transformers." In The Thirteenth International Conference on Learning Representations.
This code is developed on the top of PolaFormer.





