Ashish Patel 🇮🇳’s Post

View profile for Ashish Patel 🇮🇳

Oracle105K followers

𝗗𝗮𝘆-𝟯𝟮𝟳 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗥𝗲𝘀𝗲𝗮𝗿𝗰𝗵𝗲𝗿 𝗳𝗿𝗼𝗺 𝗨𝗻𝗶𝘃𝗲𝗿𝘀𝗶𝘁𝘆 𝗼𝗳 𝗡𝗼𝗿𝘁𝗵 𝗖𝗮𝗿𝗼𝗹𝗶𝗻𝗮 𝗵𝗮𝘀 𝗶𝗻𝘁𝗿𝗼𝗱𝘂𝗰𝗲𝗱 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗻𝗴 𝗩𝗶𝘀𝘂𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹, 𝗟𝗶𝗻𝗴𝘂𝗶𝘀𝘁𝗶𝗰 𝗮𝗻𝗱 𝗖𝗼𝗺𝗺𝗼𝗻𝘀𝗲𝗻𝘀𝗲 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗶𝗻𝘁𝗼 𝗦𝘁𝗼𝗿𝘆 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 Follow me for a similar post: 🇮🇳 Ashish Patel ------------------------------------------------------------------- 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗶𝗻𝗴 𝗙𝗮𝗰𝘁𝘀 : 🔸 Paper: 𝗜𝗻𝘁𝗲𝗴𝗿𝗮𝘁𝗶𝗻𝗴 𝗩𝗶𝘀𝘂𝗼𝘀𝗽𝗮𝘁𝗶𝗮𝗹, 𝗟𝗶𝗻𝗴𝘂𝗶𝘀𝘁𝗶𝗰 𝗮𝗻𝗱 𝗖𝗼𝗺𝗺𝗼𝗻𝘀𝗲𝗻𝘀𝗲 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗶𝗻𝘁𝗼 𝗦𝘁𝗼𝗿𝘆 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘇𝗮𝘁𝗶𝗼𝗻 🔸 This paper is published in EMNLP 2021. 🔸 Story Visualization is an emerging area of research with several potentially interesting applications such as visualization of educational materials, assisting artists with web-comic creation etc. Each story consists of a sequence of images along with a sequence of captions describing the content of the images. The goal of the task is to reproduce the images given the captions. ------------------------------------------------------------------- 𝗜𝗠𝗣𝗢𝗥𝗧𝗔𝗡𝗖𝗘 🔸 While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story).  🔸 Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input.  🔸 Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.  🔸 Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-the-shelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuospatial information. ------------------------------------------------------------------- #computervision #artificialintelligence #innovation -------------------------------------------------------------------

  • diagram

To view or add a comment, sign in

Explore content categories