𝗗𝗮𝘆-𝟯𝟰𝟴 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 Google AI’s ‘TokenLearner’ Can Improve Vision Transformer Efficiency And Accuracy Follow me for a similar post: 🇮🇳 Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 Paper: TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? 🔸 This paper is published in arxiv2021. 🔸 Transformer models consistently obtain state-of-the-art computer vision tasks, including object detection and video classification. In standard convolutional approaches, images are processed pixel-by-pixel. To obtain visual tokens, this method uses hand-designed splitting algorithms. It entails processing a large number of densely sampled patches. 🔹Instead of taking the traditional way, Google AI developed a method for extracting critical tokens from visual data. The Vision Transformers (ViT) is a technique developed by researchers to quickly and accurately locate a few key visual tokens. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. 🔹Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. 🔸This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. 🔸Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount. We obtain comparable results to the state-of-the-arts on ImageNet while being computationally more efficient. We establish new state-of-the-arts on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD. ------------------------------------------------------------------- #computervision #artificialintelligence #innovation -------------------------------------------------------------------