𝗗𝗮𝘆-𝟮𝟱𝟱 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗔 𝗧𝘅𝗧: 𝗖𝗿𝗼𝘀𝘀𝗺𝗼𝗱𝗮𝗹 𝗘𝗻𝗱-𝘁𝗼-𝗘𝗻𝗱 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗧𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝗲𝗿𝘀 by hessian.AI , Germany Follow me for a similar post: 🇮🇳 Ashish Patel 🇮🇳 Interesting Facts : 🔸 This paper is published ICCV2021. ------------------------------------------------------------------- 𝗔𝗺𝗮𝘇𝗶𝗻𝗴 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵 : https://lnkd.in/eCpAuFmq ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA), requires an alignment of semantic concepts across domains. 🔸Despite the widespread success of end-to-end learning, today's multimodal pipelines by and large leverage pre-extracted, fixed features from object detectors, typically Faster R-CNN, as representations of the visual world. 🔸The obvious downside is that the visual representation is not specifically tuned to the multimodal task at hand. At the same time, while transformer-based object detectors have gained popularity, they have not been employed in today's multimodal pipelines. 🔸We address both shortcomings with TxT, a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task in a fully end-to-end manner. 🔸We overcome existing limitations of transformer-based detectors for multimodal reasoning regarding the integration of global context and their scalability. Our transformer-based multimodal model achieves considerable gains from end-to-end learning for multimodal question answering. #computervision #artificialintelligence #machinelearning
https://github.com/ashishpatel26/365-Days-Computer-Vision-Learning-Linkedin-Post