How to Vision Transformers work? | Ashish Patel 🇮🇳

Oracle•105K followers

𝗗𝗮𝘆-𝟰𝟭𝟮 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 How Do Vision Transformers Work? by Yonsei University, South Korea Follow me for a similar post: Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸This paper is published arxiv2022. 👉 Our present work demonstrates that MSAs are not merely generalized Convs, but rather generalized spatial smoothings that complement Convs. MSAs help NNs learn strong representations by ensem- bling feature map points and flattening the loss landscape. Since the main objective of this work is to investigate the nature of MSA for computer vision, we preserve the architectures of Conv and MSA blocks in AlterNet. Thus, AlterNet has a strong potential for future improvements. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 ✔️ The success of multi-head self-attentions (MSAs) for computer vision is now indisputable. However, little is known about how MSAs work. ✔️ We present fundamental explanations to help better understand the nature of MSAs. In particular, we demonstrate the following properties of MSAs and Vision Transformers (ViTs): 💁 (1) MSAs improve not only accuracy but also generalization by flattening the loss landscapes. Such improvement is primarily attributable to their data specificity, not long-range dependency. On the other hand, ViTs suffer from non-convex losses. Large datasets and loss landscape smoothing methods alleviate this problem; 💁 (2) MSAs and Convs exhibit opposite behaviors. For example, MSAs are low-pass filters, but Convs are high-pass filters. Therefore, MSAs and Convs are complementary; 💁 (3) Multi-stage neural networks behave like a series connection of small individual models. In addition, MSAs at the end of a stage play a key role in prediction. Based on these insights, we propose AlterNet, a model in which Conv blocks at the end of a stage are replaced with MSA blocks. AlterNet outperforms CNNs not only in large data regimes but also in small data regimes. #computervision #artificialintelligence #data

1 Comment

Ashish Patel 🇮🇳

Oracle•105K followers

https://github.com/xxxnell/how-do-vits-work https://arxiv.org/abs/2202.06709v1 If you are interested to learn computer vision related stuff join below groups. Computer vision group for latest research, Computer vision code, Computer Vision Memes and Computer vision Courses, Computer Vision Post, Computer Vision Videos. Telegram Group https://t.me/joinchat/_YaCtAkPRecxNjM9

3 Reactions

To view or add a comment, sign in

LinkedIn respects your privacy

Ashish Patel 🇮🇳’s Post

More from this author

How I Read This Book on DeepSeek — And Where Each Chapter Actually Helped Me in the Real World

From Concept to Scalable LLM: Exploring the Power of Model Context Protocol

90% of Top Companies Are Implementing AI Agents—Don’t Get Left Behind

Explore content categories