{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,1,26]],"date-time":"2026-01-26T02:34:50Z","timestamp":1769394890213,"version":"3.49.0"},"reference-count":50,"publisher":"Institution of Engineering and Technology (IET)","issue":"4","license":[{"start":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T00:00:00Z","timestamp":1675728000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/creativecommons.org\/licenses\/by-nc-nd\/4.0\/"}],"funder":[{"DOI":"10.13039\/501100004763","name":"Natural Science Foundation of Inner Mongolia","doi-asserted-by":"publisher","award":["2020MS06025"],"award-info":[{"award-number":["2020MS06025"]}],"id":[{"id":"10.13039\/501100004763","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100001809","name":"National Natural Science Foundation of China","doi-asserted-by":"publisher","award":["62062055"],"award-info":[{"award-number":["62062055"]}],"id":[{"id":"10.13039\/501100001809","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/501100015397","name":"Department of Science and Technology of Inner Mongolia","doi-asserted-by":"publisher","award":["2019GG372"],"award-info":[{"award-number":["2019GG372"]}],"id":[{"id":"10.13039\/501100015397","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["ietresearch.onlinelibrary.wiley.com"],"crossmark-restriction":true},"short-container-title":["IET Computer Vision"],"published-print":{"date-parts":[[2023,6]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video\u2010text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross\u2010fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross\u2010fusion for each modality via layers of cross\u2010modal transformer blocks, resulting in cross\u2010modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross\u2010modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross\u2010fusion with the reconstructor structure on the proposed model.<\/jats:p>","DOI":"10.1049\/cvi2.12173","type":"journal-article","created":{"date-parts":[[2023,2,7]],"date-time":"2023-02-07T05:29:05Z","timestamp":1675747745000},"page":"389-403","update-policy":"https:\/\/doi.org\/10.1002\/crossmark_policy","source":"Crossref","is-referenced-by-count":4,"title":["MCR: Multilayer cross\u2010fusion with reconstructor for multimodal abstractive summarisation"],"prefix":"10.1049","volume":"17","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7937-917X","authenticated-orcid":false,"given":"Jingshu","family":"Yuan","sequence":"first","affiliation":[{"name":"College of Data Science and Application Inner Mongolia University of Technology  Huhhot China"},{"name":"Inner Mongolia Autonomous Region Engineering &amp; Technology Research Center of Big Data Based Software Service  Huhhot China"}]},{"given":"Jing","family":"Yun","sequence":"additional","affiliation":[{"name":"College of Data Science and Application Inner Mongolia University of Technology  Huhhot China"},{"name":"Inner Mongolia Autonomous Region Engineering &amp; Technology Research Center of Big Data Based Software Service  Huhhot China"}]},{"given":"Bofei","family":"Zheng","sequence":"additional","affiliation":[{"name":"College of Data Science and Application Inner Mongolia University of Technology  Huhhot China"},{"name":"Inner Mongolia Autonomous Region Engineering &amp; Technology Research Center of Big Data Based Software Service  Huhhot China"}]},{"given":"Lei","family":"Jiao","sequence":"additional","affiliation":[{"name":"College of Data Science and Application Inner Mongolia University of Technology  Huhhot China"},{"name":"Inner Mongolia Autonomous Region Engineering &amp; Technology Research Center of Big Data Based Software Service  Huhhot China"}]},{"given":"Limin","family":"Liu","sequence":"additional","affiliation":[{"name":"College of Data Science and Application Inner Mongolia University of Technology  Huhhot China"}]}],"member":"265","published-online":{"date-parts":[[2023,2,7]]},"reference":[{"key":"e_1_2_11_2_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1659"},{"key":"e_1_2_11_3_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.neucom.2021.04.072"},{"key":"e_1_2_11_4_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-main.144"},{"key":"e_1_2_11_5_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v33i01.3301126"},{"key":"e_1_2_11_6_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00795"},{"key":"e_1_2_11_7_1","first-page":"6558","volume-title":"Proceedings of the Conference. Association for Computational Linguistics. Meeting","author":"Hubert Tsai Y.\u2010H.","year":"2019"},{"key":"e_1_2_11_8_1","unstructured":"Sanabria R. et\u00a0al.:How2: a large\u2010scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 (2018)"},{"key":"e_1_2_11_9_1","doi-asserted-by":"publisher","DOI":"10.1109\/SKG49510.2019.00029"},{"key":"e_1_2_11_10_1","doi-asserted-by":"publisher","DOI":"10.1049\/cvi2.12087"},{"key":"e_1_2_11_11_1","first-page":"958","volume-title":"Proceedings of the IEEE\/CVF Conference on Computer Vision and Pattern Recognition Workshops","author":"Iashin V.","year":"2020"},{"key":"e_1_2_11_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3479207"},{"key":"e_1_2_11_13_1","doi-asserted-by":"publisher","DOI":"10.1109\/SKG.2018.00033"},{"key":"e_1_2_11_14_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ipm.2019.102123"},{"key":"e_1_2_11_15_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D17-1114"},{"key":"e_1_2_11_16_1","doi-asserted-by":"publisher","DOI":"10.1109\/tkde.2018.2848260"},{"key":"e_1_2_11_17_1","doi-asserted-by":"publisher","DOI":"10.3390\/app11115260"},{"key":"e_1_2_11_18_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.519"},{"key":"e_1_2_11_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-45442-5_24"},{"key":"e_1_2_11_20_1","doi-asserted-by":"publisher","DOI":"10.1145\/3445794"},{"key":"e_1_2_11_21_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2021.emnlp-main.326"},{"key":"e_1_2_11_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00725"},{"key":"e_1_2_11_23_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.coling-main.496"},{"key":"e_1_2_11_24_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1448"},{"key":"e_1_2_11_25_1","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D18-1438"},{"key":"e_1_2_11_26_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i05.6525"},{"key":"e_1_2_11_27_1","doi-asserted-by":"crossref","unstructured":"Khullar A. Arora U.:MAST: multimodal abstractive summarization with trimodal hierarchical attention. arXiv preprint arXiv:2010.08021 (2020)","DOI":"10.18653\/v1\/2020.nlpbt-1.7"},{"key":"e_1_2_11_28_1","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475321"},{"key":"e_1_2_11_29_1","first-page":"9694","article-title":"Align before fuse: vision and language representation learning with momentum distillation","volume":"34","author":"Li J.","year":"2021","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_2_11_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/tpami.2022.3152247"},{"key":"e_1_2_11_31_1","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v34i07.6766"},{"key":"e_1_2_11_32_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00033"},{"key":"e_1_2_11_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPRW53098.2021.00374"},{"key":"e_1_2_11_34_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2019.00756"},{"key":"e_1_2_11_35_1","unstructured":"Hu X. et\u00a0al.:VLM: task\u2010agnostic video\u2010language model pre\u2010training for video understanding. arXiv preprint arXiv:2105.09996 (2021)"},{"key":"e_1_2_11_36_1","unstructured":"Sun C. et\u00a0al.:Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 (2019)"},{"key":"e_1_2_11_37_1","first-page":"13","article-title":"ViLBERT: pretraining task\u2010agnostic visiolinguistic representations for vision\u2010and\u2010language tasks","volume":"32","author":"Lu J.","year":"2019","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"e_1_2_11_38_1","unstructured":"Luo H. et\u00a0al.:UniVL: a unified video and language pre\u2010training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)"},{"key":"e_1_2_11_39_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00685"},{"key":"e_1_2_11_40_1","unstructured":"Kay W. et\u00a0al.:The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)"},{"key":"e_1_2_11_41_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICSCC.2019.8843652"},{"key":"e_1_2_11_42_1","doi-asserted-by":"crossref","unstructured":"See A. Liu P.J. Manning C.D.:Get to the point: summarization with pointer\u2010generator networks. arXiv preprint arXiv:1704.04368 (2017)","DOI":"10.18653\/v1\/P17-1099"},{"key":"e_1_2_11_43_1","doi-asserted-by":"crossref","unstructured":"Luong M.\u2010T. Pham H. Manning C.D.:Effective approaches to attention\u2010based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)","DOI":"10.18653\/v1\/D15-1166"},{"key":"e_1_2_11_44_1","first-page":"6000","volume-title":"Advances in Neural Information Processing Systems","author":"Vaswani A.","year":"2017"},{"key":"e_1_2_11_45_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00911"},{"key":"e_1_2_11_46_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR46437.2021.00553"},{"key":"e_1_2_11_47_1","unstructured":"Kingma D.P. Jimmy B.:Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)"},{"key":"e_1_2_11_48_1","first-page":"74","volume-title":"Text Summarization Branches Out","author":"Lin C.\u2010Y.","year":"2004"},{"key":"e_1_2_11_49_1","first-page":"311","volume-title":"Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics","author":"Kishore P.","year":"2002"},{"key":"e_1_2_11_50_1","volume-title":"Workshop on Statistical Machine Translation","author":"Denkowski M.","year":"2011"},{"key":"e_1_2_11_51_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2015.7299087"}],"container-title":["IET Computer Vision"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/cvi2.12173","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/full-xml\/10.1049\/cvi2.12173","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/pdf\/10.1049\/cvi2.12173","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,27]],"date-time":"2025-10-27T09:54:58Z","timestamp":1761558898000},"score":1,"resource":{"primary":{"URL":"https:\/\/ietresearch.onlinelibrary.wiley.com\/doi\/10.1049\/cvi2.12173"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,2,7]]},"references-count":50,"journal-issue":{"issue":"4","published-print":{"date-parts":[[2023,6]]}},"alternative-id":["10.1049\/cvi2.12173"],"URL":"https:\/\/doi.org\/10.1049\/cvi2.12173","archive":["Portico"],"relation":{},"ISSN":["1751-9632","1751-9640"],"issn-type":[{"value":"1751-9632","type":"print"},{"value":"1751-9640","type":"electronic"}],"subject":[],"published":{"date-parts":[[2023,2,7]]},"assertion":[{"value":"2022-05-17","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-01-08","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2023-02-07","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}