{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T17:38:41Z","timestamp":1767029921312,"version":"3.48.0"},"reference-count":73,"publisher":"Institution of Engineering and Technology (IET)","issue":"1","license":[{"start":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T00:00:00Z","timestamp":1748476800000},"content-version":"vor","delay-in-days":148,"URL":"http:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100012456","name":"National Social Science Fund of China","doi-asserted-by":"publisher","award":["23BJL035"],"award-info":[{"award-number":["23BJL035"]}],"id":[{"id":"10.13039\/501100012456","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010906","name":"NSAF Joint Fund","doi-asserted-by":"publisher","award":["62192783"],"award-info":[{"award-number":["62192783"]}],"id":[{"id":"10.13039\/501100010906","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100010906","name":"NSAF Joint Fund","doi-asserted-by":"publisher","award":["62376117"],"award-info":[{"award-number":["62376117"]}],"id":[{"id":"10.13039\/501100010906","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Computer Vision"],"published-print":{"date-parts":[[2025,1]]},"abstract":"<jats:title>ABSTRACT<\/jats:title>\n                  <jats:p>The visual dialogue task requires computers to comprehend image content and preceding question\u2010and\u2010answer history to accurately answer related questions, with each round of dialogue providing the necessary historical context for subsequent interactions. Existing research typically processes multiple questions related to a single image as independent samples, which results in redundant modelling of the images and their captions and substantially increases computational costs. To address the challenges above, we introduce a fast transformer for visual dialogue, termed FastVDT, which utilises novel attention masks and continuous positional encoding. FastVDT models multiple image\u2010related questions as an integrated entity, accurately processing prior conversation history in each dialogue round while predicting answers to multiple questions. Our method effectively captures the interrelations among questions and significantly reduces computational overhead. Experimental results demonstrate that our method delivers outstanding performance on the VisDial v0.9 and v1.0 datasets. FastVDT achieves comparable performance to VD\u2010BERT and VU\u2010BERT while reducing computational costs by 80% and 56%, respectively.<\/jats:p>","DOI":"10.1049\/cvi2.70022","type":"journal-article","created":{"date-parts":[[2025,5,29]],"date-time":"2025-05-29T10:04:43Z","timestamp":1748513083000},"update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["FastVDT: Fast Transformer With Optimised Attention Masks and Positional Encoding for Visual Dialogue"],"prefix":"10.1049","volume":"19","author":[{"ORCID":"https:\/\/orcid.org\/0009-0002-7937-8526","authenticated-orcid":false,"given":"Qiangqiang","family":"He","sequence":"first","affiliation":[{"name":"State Key Laboratory for Novel Software Technology Nanjing University  Nanjing China"},{"name":"Department of Computer Science and Technology Nanjing University  Nanjing China"}]},{"given":"Shuwei","family":"Qian","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology Nanjing University  Nanjing China"},{"name":"Department of Computer Science and Technology Nanjing University  Nanjing China"}]},{"given":"Chongjun","family":"Wang","sequence":"additional","affiliation":[{"name":"State Key Laboratory for Novel Software Technology Nanjing University  Nanjing China"},{"name":"Department of Computer Science and Technology Nanjing University  Nanjing China"}]}],"member":"265","published-online":{"date-parts":[[2025,5,29]]},"reference":[{"key":"e_1_2_10_2_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2017.121"},{"key":"e_1_2_10_3_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2015.279"},{"key":"e_1_2_10_4_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.biosx.2022.100265"},{"key":"e_1_2_10_5_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-319-93846-2_77"},{"key":"e_1_2_10_6_1","first-page":"730","volume-title":"Proceedings of the 3rd IAPR Asian Conference on Pattern Recognition","author":"Simonyan K.","year":"2015"},{"key":"e_1_2_10_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_2_10_8_1","article-title":"Faster R\u2010CNN: Towards Real\u2010Time Object Detection With Region Proposal Networks","volume":"28","author":"Ren S.","year":"2015","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_10_9_1","doi-asserted-by":"publisher","DOI":"10.1162\/neco.1997.9.8.1735"},{"key":"e_1_2_10_10_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1648"},{"key":"e_1_2_10_11_1","first-page":"4989","volume-title":"Proceedings of the 28th International Joint Conference on Artificial Intelligence","author":"G D.","year":"2019"},{"key":"e_1_2_10_12_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.728"},{"key":"e_1_2_10_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/tip.2020.2992888"},{"key":"e_1_2_10_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.102152"},{"key":"e_1_2_10_15_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.imavis.2021.104316"},{"key":"e_1_2_10_16_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.20"},{"key":"e_1_2_10_17_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2021.3085755"},{"key":"e_1_2_10_18_1","article-title":"Attention Is All You Need","volume":"30","author":"Vaswani A.","year":"2017","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_10_19_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.269"},{"key":"e_1_2_10_20_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICASSP43922.2022.9746098"},{"key":"e_1_2_10_21_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58586-0_14"},{"key":"e_1_2_10_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52688.2022.01757"},{"key":"e_1_2_10_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-acl.38"},{"key":"e_1_2_10_24_1","article-title":"ViLBERT: Pre Training Task\u2010Agnostic Visiolinguistic Representations for Vision\u2010and\u2010Language Tasks","volume":"32","author":"Lu J.","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_10_25_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_7"},{"key":"e_1_2_10_26_1","doi-asserted-by":"publisher","DOI":"10.1109\/IROS58592.2024.10802555"},{"key":"e_1_2_10_27_1","first-page":"11328","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Lei C.","year":"2020"},{"key":"e_1_2_10_28_1","first-page":"3306","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Li Z.","year":"2024"},{"volume-title":"NIPS 2014 Workshop on Deep Learning","year":"2014","author":"Chung J.","key":"e_1_2_10_29_1"},{"key":"e_1_2_10_30_1","first-page":"10055","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Guo D.","year":"2020"},{"key":"e_1_2_10_31_1","doi-asserted-by":"publisher","DOI":"10.1145\/3591106.3592272"},{"key":"e_1_2_10_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/TCSVT.2022.3207228"},{"key":"e_1_2_10_33_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1516"},{"key":"e_1_2_10_34_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.findings-emnlp.93"},{"key":"e_1_2_10_35_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.knosys.2023.110427"},{"key":"e_1_2_10_36_1","first-page":"4171","volume-title":"Proceedings of NAACL\u2010HLT","author":"Kenton J. D. M.\u2010W. C.","year":"2019"},{"volume-title":"Proceedings of the 9th International Conference on Learning Representations","year":"2021","author":"Dosovitskiy A.","key":"e_1_2_10_37_1"},{"key":"e_1_2_10_38_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1514"},{"key":"e_1_2_10_39_1","article-title":"VisualBERT: A Simple and Performant Baseline for Vision and Language","author":"Li L. H.","year":"2019","journal-title":"arXiv preprint arXiv:1908.03557"},{"volume-title":"International Conference on Learning Representations","year":"2019","author":"Su W.","key":"e_1_2_10_40_1"},{"key":"e_1_2_10_41_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-58577-8_8"},{"key":"e_1_2_10_42_1","first-page":"5583","volume-title":"International Conference on Machine Learning","author":"Kim W.","year":"2021"},{"key":"e_1_2_10_43_1","first-page":"336","volume-title":"European Conference on Computer Vision","author":"Murahari V.","year":"2020"},{"key":"e_1_2_10_44_1","first-page":"34892","article-title":"Visual Instruction Tuning","volume":"36","author":"Liu H.","year":"2023","journal-title":"Advances in Neural Information Processing Systems"},{"volume-title":"Proceedings of the 12th International Conference on Learning Representations","year":"2024","author":"Zhu D.","key":"e_1_2_10_45_1"},{"key":"e_1_2_10_46_1","first-page":"49250","volume-title":"Proceedings of the 37th International Conference on Neural Information Processing Systems","author":"Dai W.","year":"2023"},{"volume-title":"Proceedings of the 12th International Conference on Learning Representations","year":"2024","author":"Gao P.","key":"e_1_2_10_47_1"},{"key":"e_1_2_10_48_1","first-page":"10423","volume-title":"Proceedings of the 31st International Conference on Computational Linguistics","author":"Hou H.","year":"2025"},{"key":"e_1_2_10_49_1","unstructured":"A.Radford K.Narasimhan T.Salimans I.Sutskever et\u00a0al. Improving Language Understanding by Generative Pre\u2010Training OpenAI preprint (2018)."},{"issue":"8","key":"e_1_2_10_50_1","first-page":"9","article-title":"Language Models Are Unsupervised Multitask Learners","volume":"1","author":"Radford A.","year":"2019","journal-title":"OpenAI Blog"},{"key":"e_1_2_10_51_1","first-page":"1877","article-title":"Language Models Are Few\u2010Shot Learners","volume":"33","author":"Brown T.","year":"2020","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_10_52_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1285"},{"key":"e_1_2_10_53_1","article-title":"XLNet: Generalized Autoregressive Pretraining for Language Understanding","volume":"32","author":"Yang Z.","year":"2019","journal-title":"Advances in Neural Information Processing Systems"},{"key":"e_1_2_10_54_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1032"},{"key":"e_1_2_10_55_1","article-title":"An End\u2010to\u2010End Attention\u2010Based Approach for Learning on Graphs","author":"Buterez D.","year":"2024","journal-title":"arXiv:2402.10793"},{"key":"e_1_2_10_56_1","first-page":"18074","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Zhang H.","year":"2023"},{"key":"e_1_2_10_57_1","first-page":"1290","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Cheng B.","year":"2022"},{"key":"e_1_2_10_58_1","first-page":"14720","volume-title":"Proceedings of the AAAI Conference on Artificial Intelligence","author":"Bozorgtabar B.","year":"2023"},{"key":"e_1_2_10_59_1","first-page":"3162","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition","author":"Song Y.","year":"2024"},{"issue":"1","key":"e_1_2_10_60_1","first-page":"5485","article-title":"Exploring the Limits of Transfer Learning With a Unified Text\u2010to\u2010Text Transformer","volume":"21","author":"Raffel C.","year":"2020","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_10_61_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2023.127063"},{"key":"e_1_2_10_62_1","first-page":"6327","volume-title":"International Conference on Machine Learning","author":"Liu X.","year":"2020"},{"key":"e_1_2_10_63_1","doi-asserted-by":"publisher","DOI":"10.1109\/JSTARS.2024.3487846"},{"key":"e_1_2_10_64_1","article-title":"V2PE: Improving Multimodal Long\u2010Context Capability of Vision\u2010Language Models With Variable Visual Position Encoding","author":"Ge J.","year":"2024","journal-title":"arXiv preprint arXiv:2412.09616"},{"key":"e_1_2_10_65_1","doi-asserted-by":"publisher","DOI":"10.1162\/tacl_a_00065"},{"key":"e_1_2_10_66_1","article-title":"Layer Normalization","author":"Ba J. L.","year":"2016","journal-title":"arXiv preprint arXiv:1607.06450"},{"key":"e_1_2_10_67_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00214"},{"key":"e_1_2_10_68_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.93"},{"key":"e_1_2_10_69_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-01267-0_10"},{"key":"e_1_2_10_70_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00683"},{"key":"e_1_2_10_71_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00265"},{"key":"e_1_2_10_72_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D19-1209"},{"key":"e_1_2_10_73_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6769"},{"key":"e_1_2_10_74_1","doi-asserted-by":"publisher","DOI":"10.1145\/582415.582418"}],"container-title":["IET Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/cvi2.70022","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,29]],"date-time":"2025-12-29T17:34:53Z","timestamp":1767029693000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/cvi2.70022"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,1]]},"references-count":73,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2025,1]]}},"alternative-id":["10.1049\/cvi2.70022"],"URL":"https:\/\/doi.org\/10.1049\/cvi2.70022","archive":["Portico"],"relation":{},"ISSN":["1751-9632","1751-9640"],"issn-type":[{"type":"print","value":"1751-9632"},{"type":"electronic","value":"1751-9640"}],"subject":[],"published":{"date-parts":[[2025,1]]},"assertion":[{"value":"2024-07-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-04-07","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-05-29","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}],"article-number":"e70022"}}