Skip to main content
Log in

Multimodal adaptive fusion for enhanced long-term action anticipation

  • Research
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Action anticipation, which aims to forecast future activities from partially observed sequences, plays a crucial role in advancing computer vision applications. Traditional methods primarily rely on visual cues, limiting their capability to capture long-term dependencies and contextual semantics. This paper introduces the Semantic-Guided Adaptive Fusion Transformer (SAFT), a novel framework that integrates visual and textual information through a Visual Transformer Anticipation Module, a Sequential Context Correction Module, and an Adaptive Fusion Control Module. Experimental results on benchmark datasets demonstrate SAFT’s superior performance, outperforming state-of-the-art methods in most experimental configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Furnari, A., Farinella, G.M.: Rolling-unrolling lstms for action anticipation from first-person video. IEEE Trans. Pattern Anal. Mach. Intell. 43(11), 4021–4036 (2021)

    Article  Google Scholar 

  2. Marchetti, F., Becattini, F., Seidenari, L., Bimbo, A.D.: Multiple trajectory prediction of moving agents with memory augmented networks. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6688–6702 (2023)

    Article  Google Scholar 

  3. Hawkins, K.P., Vo, N., Bansal, S., Bobick, A.F.: Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration. In 2013 13th IEEE-RAS international conference on humanoid robots (Humanoids), pp. 499–506, (2013)

  4. Qi, Z., Wang, S., Su, C., Su, L., Huang, Q., Tian, Q.: Self-regulated learning for egocentric video activity anticipation. IEEE Trans. Pattern Anal. Mach. Intell. 45(6), 6715–6730 (2023)

    Article  Google Scholar 

  5. Farha, Y.A., Richard, A., Gall, J.: When will you do what?–anticipating temporal occurrences of activities. In 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, Salt Lake City, UT, USA, Computer Vision Foundation / IEEE Computer Society, 2018) June 18–22, pp. 5343–5352 (2018)

  6. Girdhar, R., Grauman, K.: Anticipative video transformer. In 2021 IEEE/CVF international conference on computer vision, ICCV 2021, IEEE. Montreal, QC, Canada, October 10–17, 2021, pp. 13485–13495 (2021)

  7. Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 3042–3051 (2022)

  8. Nawhal, M., Jyothi, A.A., Mori, G.: Rethinking learning approaches for long-term action anticipation. In computer vision–ECCV 2022, lecture notes in computer science, In 17th European conference on computer vision (ECCV 2022) Springer, Berlin, pp. 558–576 (2022).

  9. Girase, H., Agarwal, N., Choi, C., Mangalam, K.: Latency matters: real-time action forecasting transformer. In IEEE/CVF conference on computer vision and pattern recognition, CVPR 2023, Vancouver, IEEE. BC, Canada, June 17-24, 2023, pp. 18759–18769 (2023)

  10. Wang, B., Tian, Y., Wang, C., Yang, L.: Llmaction: Adapting large language model for long-term action anticipation. In Zhouchen Lin and et al., editors, Pattern recognition and computer vision, volume 15040 of Lecture Notes in Computer Science, In 7th Chinese conference on pattern recognition and computer vision (PRCV 2024) pp. 269–283. Springer, Singapore, (2025).

  11. Kuehne, H., Arslan, A., Serre, T.: The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE conference on computer vision and pattern recognition, pp. 780–787 (2014)

  12. Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In Proceedings of the 2013 ACM international joint conference on pervasive and ubiquitous computing, UbiComp ’13, ACM. pp. 729–738 (2013)

  13. Mao, K., Jin, P., Ping, Y., Tang, B.: Modeling multi-scale sub-group context for group activity recognition. Appl. Intell. 53(1), 1149–1161 (2023)

    Article  Google Scholar 

  14. Wang, B., Yu, J., Wang, K., Bao, X., Mao, K.: Fall detection based on dual-channel feature integration. IEEE Access 8, 103443–103453 (2020)

    Article  Google Scholar 

  15. Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF international conference on computer vision workshop (ICCVW), pp. 1197–1204 (2019)

  16. Sener, F., Singhania, D., Yao, A.: Temporal aggregate representations for long-range video understanding. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision–ECCV 2020–16th European conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XVI, volume 12361 of Lecture Notes in Computer Science, pages 154–171. Springer, (2020)

  17. Furnari, A., Farinella, G.: What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6251–6260, (2019)

  18. Mascaró, E.V., Ahn, H., Lee, D.: Intention-conditioned long-term human egocentric action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6037–6046, (2023)

  19. Roy, D., Fernando, B.: Action anticipation using latent goal learning. In 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 808–816, (2022)

  20. Roy, D., Fernando, B.: Predicting the next action by modeling the abstract goal. In Apostolos Antonacopoulos, Swapan Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Subhasish Bhattacharya, and Umapada Pal, editors, Pattern Recognition, volume 15315 of Lecture Notes in Computer Science, page 162–177, Cham, (2025). Springer. 26th International Conference on Pattern Recognition (ICPR 2024)

  21. Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 3118–3125, (2016)

  22. Zhong, Z., Schneider, D., Voit, M., Stiefelhagen, R., Beyerer, J.: Anticipative feature fusion transformer for multi-modal action anticipation. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6057–6066, (2023)

  23. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, (2021)

  24. Xing, M., Feng, Z., Zhu, S., Su, Y., Gribova, V.V., Filaretoy, V.F., Huang, D.: Unbiased spatial-temporal atomic fusion-based zero-shot action recognition. Inf. Fusion 126, 103633 (2026)

    Article  Google Scholar 

  25. Su, Y., Xing, M., An, S., Peng, W., Feng, Z.: VDARN: video disentangling attentive relation network for few-shot and zero-shot action recognition. Ad Hoc Netw. 113, 102380 (2021)

    Article  Google Scholar 

  26. Su, Y., Tan, Y., An, S., Xing, M., Feng, Z.: Semantic-driven dual consistency learning for weakly supervised video anomaly detection. Pattern Recognit. 157, 110898 (2025)

    Article  Google Scholar 

  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, (2017)

  28. Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In: Akata, Z., Geiger, A., Sattler, T. (eds.) Pattern Recognition. Lecture Notes in Computer Science, vol. 12544, pp. 159–173. Springer, Cham (2021)

    Chapter  Google Scholar 

  29. Gong, D., Lee, J., Kim, M., Ha, S.J., Cho, M.: Future transformer for long-term action anticipation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3042–3051, (2022)

  30. Gupta, A., Liu, J., Bo, L., Roy-Chowdhury, A.K., Mei, T.: A-act: Action anticipation through cycle transformations, (2022)

  31. Farha, Y.A., Gall, J.: Uncertainty-aware anticipation of activities. In 2019 IEEE/CVF International Conference on Computer Vision Workshops, ICCV Workshops 2019, Seoul, Korea (South), October 27-28, 2019, pages 1197–1204. IEEE, (2019)

  32. Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 9925–9934. Computer Vision Foundation / IEEE, (2019)

  33. Farha, Y.A., Ke, Q., Schiele, B., Gall, J.: Long-term anticipation of activities with cycle consistency. In Zeynep Akata, Andreas Geiger, and Torsten Sattler, editors, Pattern Recognition - 42nd DAGM German Conference, DAGM GCPR 2020, Tübingen, Germany, September 28 - October 1, 2020, Proceedings, volume 12544 of Lecture Notes in Computer Science, pages 159–173. Springer, (2020)

Download references

Author information

Authors and Affiliations

Authors

Contributions

YYJ, YLX, HMF participated in the conceptualization and design of the study, and were responsible for data collection and preliminary analysis. KM, as the corresponding author, supervised the overall research process, provided theoretical guidance, and revised the manuscript critically for important intellectual content. JZ contributed to the methodology design and software implementation. XLX assisted in data validation and provided industry - relevant insights. All authors (YYJ, YLX, HMF, KM, JZ, XLX) reviewed and approved the final manuscript.

Corresponding author

Correspondence to Keming Mao.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, Y., Xiao, Y., Fang, H. et al. Multimodal adaptive fusion for enhanced long-term action anticipation. Machine Vision and Applications 37, 15 (2026). https://doi.org/10.1007/s00138-025-01774-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s00138-025-01774-w

Keywords

Profiles

  1. Jianzhe Zhao