{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,10]],"date-time":"2026-02-10T10:56:19Z","timestamp":1770720979541,"version":"3.49.0"},"reference-count":63,"publisher":"Association for Computing Machinery (ACM)","issue":"2","funder":[{"DOI":"10.13039\/501100003995","name":"Anhui Provincial Natural Science Foundation","doi-asserted-by":"crossref","award":["2308085MF220"],"award-info":[{"award-number":["2308085MF220"]}],"id":[{"id":"10.13039\/501100003995","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Anhui University Natural Science Foundation","award":["2023AH050914"],"award-info":[{"award-number":["2023AH050914"]}]}],"content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["ACM Trans. Multimedia Comput. Commun. Appl."],"published-print":{"date-parts":[[2026,2,28]]},"abstract":"<jats:p>\n                    Sarcasm in social media, frequently conveyed through the interplay of text and images, presents significant challenges for sentiment analysis and intention mining. Existing multi-modal sarcasm detection approaches have been shown to excessively depend on superficial cues within the textual modality, exhibiting limited capability to accurately discern sarcasm through subtle text\u2013image interactions. To address this limitation, a novel framework, InterCLIP-MEP, is proposed. This framework integrates Interactive CLIP (InterCLIP), which employs an efficient training strategy to derive enriched cross-modal representations by embedding inter-modal information directly into each encoder, while using approximately 20.6\n                    <jats:inline-formula content-type=\"math\/tex\">\n                      <jats:tex-math notation=\"LaTeX\" version=\"MathJax\">\\(\\times\\)<\/jats:tex-math>\n                    <\/jats:inline-formula>\n                    fewer trainable parameters compared with existing state-of-the-art (SOTA) methods. Furthermore, a Memory-Enhanced Predictor (MEP) is introduced, featuring a dynamic dual-channel memory mechanism that captures and retains valuable knowledge from test samples during inference, serving as a nonparametric classifier to enhance sarcasm detection robustness. Extensive experiments on MMSD, MMSD2.0, and DocMSU show that InterCLIP-MEP achieves SOTA performance, specifically improving accuracy by 1.08% and F1-score by 1.51% on MMSD2.0. Under distributional shift evaluation, it attains 73.96% accuracy, exceeding its memory-free variant by nearly 10% and the previous SOTA by over 15%, demonstrating superior stability and adaptability. The implementation of InterCLIP-MEP is publicly available at\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/CoderChen01\/InterCLIP-MEP\">https:\/\/github.com\/CoderChen01\/InterCLIP-MEP<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1145\/3776561","type":"journal-article","created":{"date-parts":[[2025,11,12]],"date-time":"2025-11-12T14:19:23Z","timestamp":1762957163000},"page":"1-23","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":3,"title":["InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-Modal Sarcasm Detection"],"prefix":"10.1145","volume":"22","author":[{"ORCID":"https:\/\/orcid.org\/0009-0001-5288-048X","authenticated-orcid":false,"given":"Junjie","family":"Chen","sequence":"first","affiliation":[{"name":"School of Computer and Information, Anhui Polytechnic University, Wuhu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-3444-9992","authenticated-orcid":false,"given":"Hang","family":"Yu","sequence":"additional","affiliation":[{"name":"Shanghai University, Shanghai, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0003-1886-4192","authenticated-orcid":false,"given":"Subin","family":"Huang","sequence":"additional","affiliation":[{"name":"School of Computer and Information, Anhui Polytechnic University, Wuhu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-0399-7737","authenticated-orcid":false,"given":"Sanmin","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Computer and Information, Anhui Polytechnic University, Wuhu, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-3341-183X","authenticated-orcid":false,"given":"Linfeng","family":"Zhang","sequence":"additional","affiliation":[{"name":"Shanghai Jiao Tong University, Shanghai, China"}]}],"member":"320","published-online":{"date-parts":[[2026,2,9]]},"reference":[{"key":"e_1_3_2_2_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/K16-1017"},{"key":"e_1_3_2_3_2","doi-asserted-by":"publisher","DOI":"10.1111\/j.1467-9280.1991.tb00174.x"},{"key":"e_1_3_2_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/S0378-2166(99)00070-3"},{"key":"e_1_3_2_5_2","doi-asserted-by":"publisher","DOI":"10.1016\/S1364-6613(00)01538-2"},{"key":"e_1_3_2_6_2","first-page":"574","volume-title":"Proceedings of the International AAAI Conference on Web and Social Media","volume":"9","author":"Bamman David","year":"2015","unstructured":"David Bamman and Noah Smith. 2015. Contextualized sarcasm detection on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 9, 574\u2013577."},{"key":"e_1_3_2_7_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/S18-1100"},{"key":"e_1_3_2_8_2","doi-asserted-by":"publisher","DOI":"10.1109\/GLOCOM.2015.7417640"},{"key":"e_1_3_2_9_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P19-1239"},{"key":"e_1_3_2_10_2","first-page":"4171","volume-title":"Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT \u201919), Long and Short Papers","volume":"1","author":"Devlin Jacob","year":"2019","unstructured":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT \u201919), Vol. 1, Long and Short Papers, 4171\u20134186."},{"key":"e_1_3_2_11_2","doi-asserted-by":"publisher","DOI":"10.1109\/ACAIT56212.2022.10137937"},{"key":"e_1_3_2_12_2","first-page":"1","volume-title":"Proceedings of the 9th International Conference on Learning Representations (ICLR \u201921)","author":"Dosovitskiy Alexey","year":"2021","unstructured":"Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations (ICLR \u201921), 1\u201321."},{"key":"e_1_3_2_13_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i16.29748"},{"key":"e_1_3_2_14_2","doi-asserted-by":"publisher","DOI":"10.24963\/ijcai.2024\/887"},{"key":"e_1_3_2_15_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.01315"},{"key":"e_1_3_2_16_2","doi-asserted-by":"publisher","DOI":"10.1037\/0096-3445.115.1.3"},{"key":"e_1_3_2_17_2","doi-asserted-by":"publisher","DOI":"10.4324\/9781410616685"},{"key":"e_1_3_2_18_2","doi-asserted-by":"publisher","DOI":"10.1016\/0378-2166(91)90101-3"},{"key":"e_1_3_2_19_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2016.90"},{"key":"e_1_3_2_20_2","first-page":"1","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Hu Edward J.","year":"2022","unstructured":"Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 1\u201313."},{"key":"e_1_3_2_21_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.635"},{"key":"e_1_3_2_22_2","doi-asserted-by":"publisher","DOI":"10.1145\/3124420"},{"key":"e_1_3_2_23_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/D16-1104"},{"key":"e_1_3_2_24_2","first-page":", 5583","volume-title":"Proceedings of the 38th International Conference on Machine Learning (ICML \u201921),","volume":"139","author":"Kim Wonjae","year":"2021","unstructured":"Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. ViLT: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML \u201921), Vol. 139, 5583\u20135594."},{"key":"e_1_3_2_25_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/D14-1181"},{"key":"e_1_3_2_26_2","unstructured":"Jiahao Li Greg Shakhnarovich and Raymond A. Yeh. 2022. Adapting CLIP for phrase localization without further training. arXiv:2204.03647. Retrieved from http:\/\/arxiv.org\/abs\/2204.03647"},{"key":"e_1_3_2_27_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2024.3380375"},{"key":"e_1_3_2_28_2","doi-asserted-by":"publisher","DOI":"10.1145\/3474085.3475190"},{"key":"e_1_3_2_29_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.acl-long.124"},{"key":"e_1_3_2_30_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00682"},{"key":"e_1_3_2_31_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2022.emnlp-main.333"},{"key":"e_1_3_2_32_2","unstructured":"Yinhan Liu Myle Ott Naman Goyal Jingfei Du Mandar Joshi Danqi Chen Omer Levy Mike Lewis Luke Zettlemoyer and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from http:\/\/arxiv.org\/abs\/1907.11692."},{"key":"e_1_3_2_33_2","doi-asserted-by":"publisher","DOI":"10.1109\/TAFFC.2023.3279145"},{"key":"e_1_3_2_34_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV48922.2021.00986"},{"key":"e_1_3_2_35_2","first-page":"1","volume-title":"Proceedings of the 7th International Conference on Learning Representation (ICLR \u201919)","author":"Loshchilov Ilya","year":"2019","unstructured":"Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In Proceedings of the 7th International Conference on Learning Representation (ICLR \u201919), 1\u201318."},{"key":"e_1_3_2_36_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.findings-emnlp.124"},{"key":"e_1_3_2_37_2","doi-asserted-by":"publisher","DOI":"10.1561\/1500000011"},{"key":"e_1_3_2_38_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.findings-acl.689"},{"key":"e_1_3_2_39_2","first-page":"8748","volume-title":"Proceedings of the International Conference on Machine Learning","author":"Radford Alec","year":"2021","unstructured":"Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 8748\u20138763."},{"key":"e_1_3_2_40_2","doi-asserted-by":"publisher","DOI":"10.1145\/2964284.2964321"},{"key":"e_1_3_2_41_2","doi-asserted-by":"publisher","DOI":"10.1109\/ICCV.2017.74"},{"key":"e_1_3_2_42_2","unstructured":"Burr Settles. 2009. Active Learning Literature Survey. Technical Report University of Wisconsin\u2013Madison Department of Computer Sciences. Retrieved from https:\/\/minds.wisconsin.edu\/handle\/1793\/60660"},{"key":"e_1_3_2_43_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.tics.2015.05.004"},{"key":"e_1_3_2_44_2","first-page":"2440","volume-title":"Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015","author":"Sukhbaatar Sainbayar","year":"2015","unstructured":"Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Proceedings of the Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, 2440\u20132448."},{"key":"e_1_3_2_45_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2024.naacl-long.97"},{"key":"e_1_3_2_46_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2023.acl-long.139"},{"key":"e_1_3_2_47_2","unstructured":"Mariya Toneva Alessandro Sordoni Remi Tachet Des Combes Adam Trischler Yoshua Bengio and Geoffrey J. Gordon. 2018. An empirical study of example forgetting during deep neural network learning. arXiv:1812.05159. Retrieved from https:\/\/arxiv.org\/abs\/1812.05159"},{"key":"e_1_3_2_48_2","first-page":"1","volume-title":"Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM \u201910),","author":"Tsur Oren","year":"2010","unstructured":"Oren Tsur, Dmitry Davidov, and Ari Rappoport. 2010. ICWSM\u2014A great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews. In Proceedings of the 4th International Conference on Weblogs and Social Media (ICWSM \u201910), 1\u20138."},{"key":"e_1_3_2_49_2","doi-asserted-by":"publisher","DOI":"10.3115\/v1\/P14-2084"},{"key":"e_1_3_2_50_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.inffus.2023.102132"},{"key":"e_1_3_2_51_2","doi-asserted-by":"crossref","unstructured":"Mengyu Wang Zhenyu Liu Kun Li Yu Wang Yuwei Wang Yanyan Wei and Fei Wang. 2025. Task-generalized adaptive cross-domain learning for multimodal image fusion. arXiv:2508.15505. Retrieved from https:\/\/arxiv.org\/abs\/2508.15505","DOI":"10.1109\/TMM.2026.3660142"},{"key":"e_1_3_2_52_2","doi-asserted-by":"publisher","DOI":"10.1145\/3581783.3612490"},{"key":"e_1_3_2_53_2","doi-asserted-by":"publisher","DOI":"10.1609\/aaai.v38i8.28766"},{"key":"e_1_3_2_54_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52729.2023.00250"},{"key":"e_1_3_2_55_2","first-page":"2540","volume-title":"Proceedings of the International Conference on Learning Representations","author":"Weston J.","year":"2014","unstructured":"J. Weston, S. Chopra, and Antoine Bordes. 2014. Memory networks. In Proceedings of the International Conference on Learning Representations, 2540\u20132550."},{"key":"e_1_3_2_56_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.emnlp-demos.6"},{"key":"e_1_3_2_57_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2018.00393"},{"key":"e_1_3_2_58_2","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313735"},{"key":"e_1_3_2_59_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/2020.acl-main.349"},{"key":"e_1_3_2_60_2","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR52733.2024.02713"},{"key":"e_1_3_2_61_2","doi-asserted-by":"publisher","DOI":"10.18653\/v1\/P16-2034"},{"issue":"1","key":"e_1_3_2_62_2","first-page":"273","article-title":"Label independent memory for semi-supervised few-shot video classification","volume":"44","author":"Zhu Linchao","year":"2020","unstructured":"Linchao Zhu and Yi Yang. 2020. Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1 (2020), 273\u2013285.","journal-title":"IEEE Transactions on Pattern Analysis and Machine Intelligence"},{"key":"e_1_3_2_63_2","doi-asserted-by":"crossref","unstructured":"Xingjie Zhuang Zhixin Li Fengling Zhou Jingliang Gu Canlong Zhang and Huifang Ma. 2025. DyCR-Net: A dynamic context-aware routing network for multi-modal sarcasm detection in conversation. Knowledge-Based Systems 310 (2025) 113029.","DOI":"10.1016\/j.knosys.2025.113029"},{"key":"e_1_3_2_64_2","doi-asserted-by":"publisher","DOI":"10.1145\/3722115"}],"container-title":["ACM Transactions on Multimedia Computing, Communications, and Applications"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3776561","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T14:57:34Z","timestamp":1770649054000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.1145\/3776561"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,9]]},"references-count":63,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2026,2,28]]}},"alternative-id":["10.1145\/3776561"],"URL":"https:\/\/doi.org\/10.1145\/3776561","relation":{},"ISSN":["1551-6857","1551-6865"],"issn-type":[{"value":"1551-6857","type":"print"},{"value":"1551-6865","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,9]]},"assertion":[{"value":"2025-04-30","order":0,"name":"received","label":"Received","group":{"name":"publication_history","label":"Publication History"}},{"value":"2025-11-05","order":2,"name":"accepted","label":"Accepted","group":{"name":"publication_history","label":"Publication History"}},{"value":"2026-02-09","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}