𝗗𝗮𝘆-𝟯𝟲𝟭 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗧𝗦𝗼𝘂𝘁𝗵 𝗖𝗵𝗶𝗻𝗮 𝗨𝗻𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆 𝗼𝗳 𝗧𝗲𝗰𝗵𝗻𝗼𝗹𝗼𝗴𝘆 𝗮𝗻𝗱 𝗔𝗹𝗶𝗯𝗮𝗯𝗮 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗲𝗿𝘀 𝗵𝗮𝘀 𝗽𝘂𝗯𝗹𝗶𝘀𝗵𝗲𝗱 𝗘𝗟𝗦𝗔: 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗟𝗼𝗰𝗮𝗹 𝗦𝗲𝗹𝗳-𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿 Follow me for a similar post: 🇮🇳 Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 Paper: 𝗘𝗟𝗦𝗔: 𝗘𝗻𝗵𝗮𝗻𝗰𝗲𝗱 𝗟𝗼𝗰𝗮𝗹 𝗦𝗲𝗹𝗳-𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗳𝗼𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿 🔸 This paper is published nature2021. 🔸 Investigate LSA and its counterparts in detail from channel settings and spatial processing to empirically understand why LSA performs mediocre. It is revealed that the relative position embedding and the neighboring filter application are critical reasons why DwConv and dynamic filters perform similar or better than LSA. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. 🔸 To clarify these, we comprehensively investigate LSA and its counterparts from two sides: \emph{channel setting} and \emph{spatial processing}. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. 🔸 Based on these findings, we propose an enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighbouring case while maintaining the high-order mapping. 🔸 The Ghost Head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture/hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer \cite{swin} by up to +1.4 on top-1 accuracy. 🔸 ELSA also consistently benefits VOLO from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. #computervision #artificialintelligence #innovation
Thanks for sharing
Wow man, it's great to see your post as each day new paper. I'm curious that 🇮🇳 Ashish Patel, have you got new idea about new approach or at least improve current CV algorithm after going through these papers?