Abstract
Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.




Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Furnari, A., Farinella, G.M.: Rolling-unrolling lstms for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)
Marchetti, F., Becattini, F., Seidenari, L., Bimbo, A.D.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6688–6702 (2023)
Hawkins, K.P., Vo, N., Bansal, S., Bobick, A.F.: Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration. In 2013 13th IEEE-RAS international conference on humanoid robots (Humanoids), pp. 499–506, (2013)
Qi, Z., Wang, S., Su, C., Su, L., Huang, Q., Tian, Q.: Self-regulated learning for egocentric video activity anticipation. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6715–6730 (2023)
Farha, Y.A., Richard, A., Gall, J.: When will you do what?–anticipating temporal occurrences of activities. In 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, Computer Vision Foundation / IEEE Computer Society, 2018) June 18–22, pp. 5343–5352 (2018)
Girdhar, R., Grauman, K.: Anticipative video transformer. In 2021 IEEE/CVF international conference on computer vision, ICCV 2021, IEEE. Montreal, QC, Canada, October 10–17, 2021, pp. 13485–13495 (2021)
Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3042–3051 (2022)
Nawhal, M., Jyothi, A.A., Mori, G.: Rethinking learning approaches for long-term action anticipation. In computer vision–ECCV 2022, lecture notes in computer science, In 17th European conference on computer vision (ECCV 2022) Springer, Berlin, pp. 558–576 (2022).
Girase, H., Agarwal, N., Choi, C., Mangalam, K.: Latency matters: real-time action forecasting transformer. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, IEEE. BC, Canada, June 17-24, 2023, pp. 18759–18769 (2023)
Wang, B., Tian, Y., Wang, C., Yang, L.: Llmaction: Adapting large language model for long-term action anticipation. In Zhouchen Lin and et al., editors, Pattern recognition and computer vision, volume 15040 of Lecture Notes in Computer Science, In 7th Chinese conference on pattern recognition and computer vision (PRCV 2024) pp. 269–283. Springer, Singapore, (2025).
Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE conference on computer vision and pattern recognition, pp. 780–787 (2014)
Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’13, ACM. pp. 729–738 (2013)
Mao, K., Jin, P., Ping, Y., Tang, B.: Modeling multi-scale sub-group context for group activity recognition. Appl. Intell. 53(1), 1149–1161 (2023)
Wang, B., Yu, J., Wang, K., Bao, X., Mao, K.: Fall detection based on dual-channel feature integration. IEEE Access 8, 103443–103453 (2020)
Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp. 1197–1204 (2019)
Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision–ECCV 2020–16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVI, volume 12361 of Lecture Notes in Computer Science, pages 154–171. Springer, (2020)
Furnari, A., Farinella, G.: What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6251–6260, (2019)
Mascaró, E.V., Ahn, H., Lee, D.: Intention-conditioned long-term human egocentric action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6037–6046, (2023)
Roy, D., Fernando, B.: Action anticipation using latent goal learning. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 808–816, (2022)
Roy, D., Fernando, B.: Predicting the next action by modeling the abstract goal. In Apostolos Antonacopoulos, Swapan Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Subhasish Bhattacharya, and Umapada Pal, editors, Pattern Recognition, volume 15315 of Lecture Notes in Computer Science, page 162–177, Cham, (2025). Springer. 26th International Conference on Pattern Recognition (ICPR 2024)
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125, (2016)
Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6057–6066, (2023)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, (2021)
Xing, M., Feng, Z., Zhu, S., Su, Y., Gribova, V.V., Filaretoy, V.F., Huang, D.: Unbiased spatial-temporal atomic fusion-based zero-shot action recognition. Inf. Fusion 126, 103633 (2026)
Su, Y., Xing, M., An, S., Peng, W., Feng, Z.: VDARN: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)
Su, Y., Tan, Y., An, S., Xing, M., Feng, Z.: Semantic-driven dual consistency learning for weakly supervised video anomaly detection. Pattern Recognit. 157, 110898 (2025)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, (2017)
Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In: Akata, Z., Geiger, A., Sattler, T. (eds.) Pattern Recognition. Lecture Notes in Computer Science, vol. 12544, pp. 159–173. Springer, Cham (2021)
Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3042–3051, (2022)
Gupta, A., Liu, J., Bo, L., Roy-Chowdhury, A.K., Mei, T.: A-act: Action anticipation through cycle transformations, (2022)
Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pages 1197–1204. IEEE, (2019)
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9925–9934. Computer Vision Foundation / IEEE, (2019)
Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In Zeynep Akata, Andreas Geiger, and Torsten Sattler, editors, Pattern Recognition - 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28 - October 1, 2020, Proceedings, volume 12544 of Lecture Notes in Computer Science, pages 159–173. Springer, (2020)
Author information
Authors and Affiliations
Contributions
YYJ, YLX, HMF participated in the conceptualization and design of the study, and were responsible for data collection and preliminary analysis. KM, as the corresponding author, supervised the overall research process, provided theoretical guidance, and revised the manuscript critically for important intellectual content. JZ contributed to the methodology design and software implementation. XLX assisted in data validation and provided industry - relevant insights. All authors (YYJ, YLX, HMF, KM, JZ, XLX) reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jing, Y., Xiao, Y., Fang, H. et al. Multimodal adaptive fusion for enhanced long-term action anticipation. Machine Vision and Applications 37, 15 (2026). https://doi.org/10.1007/s00138-025-01774-w
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s00138-025-01774-w
