𝗗𝗮𝘆-𝟯𝟱𝟵 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 SeMask: Semantically Masked Transformers for Semantic Segmentation Follow me for a similar post: 🇮🇳 Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 Paper: SeMask: Semantically Masked Transformers for Semantic Segmentation 🔸 This paper is published in arxiv2021. 🔸 Paper argues that directly finetuning off-the-shelf pretrained transformer backbone networks as encoders for semantic segmentation does not consider the semantic con- text tied up with the images. We claim that adding a seman- tic prior to guide the encoder’s feature modeling enhances the finetuning process for semantic segmentation. Further- more, to support our claim, we propose the SeMask Block which can be plugged into any existing hierarchical vision transformer and uses a semantic attention operation to cap- ture the semantic context and augment the semantic repre- sentation of the feature maps. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. 🔹However, such an approach leaves out the semantic context that an image provides during the encoding stage. 🔸This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. 🔹To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. 🔸In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. 🔹Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. 🔸We provide empirical proof by integrating SeMask into each variant of the Swin-Transformer as our encoder paired with different decoders. 🔹Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. #computervision #artificialintelligence #innovation
Thanks for sharing