Ashish Patel 🇮🇳’s Post

𝗗𝗮𝘆-𝟮𝟰𝟯 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗔 𝗕𝗮𝘁𝘁𝗹𝗲 𝗼𝗳 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝘀: An Empirical Study of CNN, Transformer, and MLP by Microsoft Asia Follow me for a similar post: 🇮🇳 Ashish Patel Interesting Facts : 🔸 This paper is published ICCV2021. ------------------------------------------------------------------- 𝗔𝗺𝗮𝘇𝗶𝗻𝗴 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 : https://lnkd.in/ertcQQtP ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task.  🔸 In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing.  🔸 Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules.  🔸 The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models will be made publicly available. #computervision #artificialintelligence #machinelearning

  • diagram
See more comments

To view or add a comment, sign in

Explore content categories