Ashish Patel 🇮🇳’s Post

𝗗𝗮𝘆-𝟯𝟳𝟭 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗟𝗮𝘄𝗶𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿: 𝗜𝗺𝗽𝗿𝗼𝘃𝗶𝗻𝗴 𝗦𝗲𝗺𝗮𝗻𝘁𝗶𝗰 𝗦𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿 𝘄𝗶𝘁𝗵 𝗠𝘂𝗹𝘁𝗶-𝗦𝗰𝗮𝗹𝗲 𝗥𝗲𝗽𝗿𝗲𝘀𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻𝘀 𝘃𝗶𝗮 𝗟𝗮𝗿𝗴𝗲 𝗪𝗶𝗻𝗱𝗼𝘄 𝗔𝘁𝘁𝗲𝗻𝘁𝗶𝗼𝗻 𝗯𝘆 𝗕𝗲𝗶𝗷𝗶𝗻𝗴 𝗨𝗻𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆 𝗼𝗳 𝗣𝗼𝘀𝘁𝘀 𝗮𝗻𝗱 𝗧𝗲𝗹𝗲𝗰𝗼𝗺𝗺𝘂𝗻𝗶𝗰𝗮𝘁𝗶𝗼𝗻𝘀 Follow me for a similar post: @🇮🇳 Ashish Patel  ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 Paper: Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention 🔸 This paper is published by arxiv2021. 🔸 Develop an efficient semantic segmentation transformer called Lawin Transformer. The decoder part of Lawin Transformer is capable of capturing rich contextual information at multiple scales, which is established on our proposed large window attention. Compared to the existing efficient semantic segmentation Transformer, Lawin Transformer can achieve higher performance with less computational expense. Finally, we conduct experiments on Cityscapes, ADE20K and COCO-Stuff dataset, yielding state-of-the-art results on these benchmarks. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information.  🔸 Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy.  🔸 In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improve the performance and efficiency.  🔸 To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales.  🔸 Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT.  🔸 Resulting ViT, Lawin Transformer, is composed of an efficient hierarchical vision transformer (HVT) as encoder and a LawinASPP as a decoder. The empirical results demonstrate that Lawin Transformer offers improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4\% mIoU), ADE20K (56.2\% mIoU) and COCO-Stuff datasets.  #computervision #artificialintelligence #innovation

  • diagram
See more comments

To view or add a comment, sign in

Explore content categories