Multimodal adaptive fusion for enhanced long-term action anticipation

Jing, Yaoyao; Xiao, Yilong; Fang, Haoming; Mao, Keming; Zhao, Jianzhe; Xiao, Xinlu

doi:10.1007/s00138-025-01774-w

Multimodal adaptive fusion for enhanced long-term action anticipation

Research
Published: 13 December 2025

Volume 37, article number 15, (2026)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Yaoyao Jing¹,
Yilong Xiao¹,
Haoming Fang¹,
Keming Mao¹,
Jianzhe Zhao¹ &
…
Xinlu Xiao²

247 Accesses
Explore all metrics

Abstract

Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

AdverFuse: robust fusion of multimodal images based on dynamic attention and adversarial learning

Article 08 April 2026

Early Stopping for Two-Stream Fusion Applied to Action Recognition

Fast Context Adaptation for Video Object Segmentation

Data availability

No datasets were generated or analysed during the current study.

References

Furnari, A., Farinella, G.M.: Rolling-unrolling lstms for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)
Article Google Scholar
Marchetti, F., Becattini, F., Seidenari, L., Bimbo, A.D.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6688–6702 (2023)
Article Google Scholar
Hawkins, K.P., Vo, N., Bansal, S., Bobick, A.F.: Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration. In 2013 13th IEEE-RAS international conference on humanoid robots (Humanoids), pp. 499–506, (2013)
Qi, Z., Wang, S., Su, C., Su, L., Huang, Q., Tian, Q.: Self-regulated learning for egocentric video activity anticipation. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6715–6730 (2023)
Article Google Scholar
Farha, Y.A., Richard, A., Gall, J.: When will you do what?–anticipating temporal occurrences of activities. In 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, Computer Vision Foundation / IEEE Computer Society, 2018) June 18–22, pp. 5343–5352 (2018)
Girdhar, R., Grauman, K.: Anticipative video transformer. In 2021 IEEE/CVF international conference on computer vision, ICCV 2021, IEEE. Montreal, QC, Canada, October 10–17, 2021, pp. 13485–13495 (2021)
Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3042–3051 (2022)
Nawhal, M., Jyothi, A.A., Mori, G.: Rethinking learning approaches for long-term action anticipation. In computer vision–ECCV 2022, lecture notes in computer science, In 17th European conference on computer vision (ECCV 2022) Springer, Berlin, pp. 558–576 (2022).
Girase, H., Agarwal, N., Choi, C., Mangalam, K.: Latency matters: real-time action forecasting transformer. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, IEEE. BC, Canada, June 17-24, 2023, pp. 18759–18769 (2023)
Wang, B., Tian, Y., Wang, C., Yang, L.: Llmaction: Adapting large language model for long-term action anticipation. In Zhouchen Lin and et al., editors, Pattern recognition and computer vision, volume 15040 of Lecture Notes in Computer Science, In 7th Chinese conference on pattern recognition and computer vision (PRCV 2024) pp. 269–283. Springer, Singapore, (2025).
Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE conference on computer vision and pattern recognition, pp. 780–787 (2014)
Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’13, ACM. pp. 729–738 (2013)
Mao, K., Jin, P., Ping, Y., Tang, B.: Modeling multi-scale sub-group context for group activity recognition. Appl. Intell. 53(1), 1149–1161 (2023)
Article Google Scholar
Wang, B., Yu, J., Wang, K., Bao, X., Mao, K.: Fall detection based on dual-channel feature integration. IEEE Access 8, 103443–103453 (2020)
Article Google Scholar
Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp. 1197–1204 (2019)
Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision–ECCV 2020–16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVI, volume 12361 of Lecture Notes in Computer Science, pages 154–171. Springer, (2020)
Furnari, A., Farinella, G.: What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6251–6260, (2019)
Mascaró, E.V., Ahn, H., Lee, D.: Intention-conditioned long-term human egocentric action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6037–6046, (2023)
Roy, D., Fernando, B.: Action anticipation using latent goal learning. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 808–816, (2022)
Roy, D., Fernando, B.: Predicting the next action by modeling the abstract goal. In Apostolos Antonacopoulos, Swapan Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Subhasish Bhattacharya, and Umapada Pal, editors, Pattern Recognition, volume 15315 of Lecture Notes in Computer Science, page 162–177, Cham, (2025). Springer. 26th International Conference on Pattern Recognition (ICPR 2024)
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125, (2016)
Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6057–6066, (2023)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, (2021)
Xing, M., Feng, Z., Zhu, S., Su, Y., Gribova, V.V., Filaretoy, V.F., Huang, D.: Unbiased spatial-temporal atomic fusion-based zero-shot action recognition. Inf. Fusion 126, 103633 (2026)
Article Google Scholar
Su, Y., Xing, M., An, S., Peng, W., Feng, Z.: VDARN: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)
Article Google Scholar
Su, Y., Tan, Y., An, S., Xing, M., Feng, Z.: Semantic-driven dual consistency learning for weakly supervised video anomaly detection. Pattern Recognit. 157, 110898 (2025)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, (2017)
Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In: Akata, Z., Geiger, A., Sattler, T. (eds.) Pattern Recognition. Lecture Notes in Computer Science, vol. 12544, pp. 159–173. Springer, Cham (2021)
Chapter Google Scholar
Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3042–3051, (2022)
Gupta, A., Liu, J., Bo, L., Roy-Chowdhury, A.K., Mei, T.: A-act: Action anticipation through cycle transformations, (2022)
Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pages 1197–1204. IEEE, (2019)
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9925–9934. Computer Vision Foundation / IEEE, (2019)
Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In Zeynep Akata, Andreas Geiger, and Torsten Sattler, editors, Pattern Recognition - 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28 - October 1, 2020, Proceedings, volume 12544 of Lecture Notes in Computer Science, pages 159–173. Springer, (2020)

Download references

Author information

Authors and Affiliations

Software College, Northeastern University, Shenyang, China
Yaoyao Jing, Yilong Xiao, Haoming Fang, Keming Mao & Jianzhe Zhao
China Mobile Group Liaoning Company Limited, Shenyang, China
Xinlu Xiao

Authors

Yaoyao Jing
View author publications
Search author on:PubMed Google Scholar
Yilong Xiao
View author publications
Search author on:PubMed Google Scholar
Haoming Fang
View author publications
Search author on:PubMed Google Scholar
Keming Mao
View author publications
Search author on:PubMed Google Scholar
Jianzhe Zhao
View author publications
Search author on:PubMed Google Scholar
Xinlu Xiao
View author publications
Search author on:PubMed Google Scholar

Contributions

YYJ, YLX, HMF participated in the conceptualization and design of the study, and were responsible for data collection and preliminary analysis. KM, as the corresponding author, supervised the overall research process, provided theoretical guidance, and revised the manuscript critically for important intellectual content. JZ contributed to the methodology design and software implementation. XLX assisted in data validation and provided industry - relevant insights. All authors (YYJ, YLX, HMF, KM, JZ, XLX) reviewed and approved the final manuscript.

Corresponding author

Correspondence to Keming Mao.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jing, Y., Xiao, Y., Fang, H. et al. Multimodal adaptive fusion for enhanced long-term action anticipation. Machine Vision and Applications 37, 15 (2026). https://doi.org/10.1007/s00138-025-01774-w

Download citation

Received: 31 August 2025
Revised: 13 November 2025
Accepted: 01 December 2025
Published: 13 December 2025
Version of record: 13 December 2025
DOI: https://doi.org/10.1007/s00138-025-01774-w

Keywords

Profiles

Jianzhe Zhao View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Multimodal adaptive fusion for enhanced long-term action anticipation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AdverFuse: robust fusion of multimodal images based on dynamic attention and adversarial learning

Early Stopping for Two-Stream Fusion Applied to Action Recognition

Fast Context Adaptation for Video Object Segmentation

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now