𝗗𝗮𝘆-𝟯𝟬𝟭 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗧𝗲𝗺𝗽𝗼𝗿𝗮𝗹-𝗮𝘁𝘁𝗲𝗻𝘁𝗶𝘃𝗲 𝗖𝗼𝘃𝗮𝗿𝗶𝗮𝗻𝗰𝗲 𝗣𝗼𝗼𝗹𝗶𝗻𝗴(𝗧𝗖𝗣) Networks for Video Recognition by Dalian University of Technology, China Follow me for a similar post: 🇮🇳 Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 This paper is published NeurIPS2021. 🔸 This paper introduces Temporal-attentive Covariance Pooling(TCP) which works better than Global max pooling. 🔸 TCP consists of a temporal attention module for adaptively calibrating Spatio-temporal features, a temporal covariance pooling to characterize intra-frame correlations and inter-frame cross-correlations of the calibrated features in a temporal manner, and a fast matrix power normalization for use of the geometry of covariances. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 For the video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. 🔸 For the image recognition task, there exist evidence showing that covariance pooling has a stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model Spatio-temporal structure inherent in videos. 🔸 Therefore, this paper proposes a Temporal-attentive Covariance Pooling(TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifically, our TCP first develops a temporal attention module to adaptively calibrate Spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. 🔸 Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to characterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. 🔸 As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit the geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architecture, resulting in TCPNet for effective video recognition. 🔸 The extensive experiments on six benchmarks using various video architectures show our TCPNet is clearly superior to its counterparts while having strong generalization ability. ------------------------------------------------------------------- #computervision #artificialintelligence #deeplearning -------------------------------------------------------------------
Amazing Research : https://arxiv.org/abs/2110.14381 Code : https://github.com/ZilinGao/Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition Github : https://github.com/ashishpatel26/365-Days-Computer-Vision-Learning-Linkedin-Post