A cascaded residual vision transformer with wavelet transform and application in behavior recognition

Liu, Jing-Wei; Ren, Hao-Tian; Du, Yu-Ran; Chen, Jia-Ming; Zhang, Jing

doi:10.1007/s10489-025-07002-2

A cascaded residual vision transformer with wavelet transform and application in behavior recognition

Published: 11 March 2026

Volume 56, article number 145, (2026)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jing-Wei Liu^1,2,
Hao-Tian Ren¹,
Yu-Ran Du¹,
Jia-Ming Chen ORCID: orcid.org/0000-0002-7162-6903² &
…
Jing Zhang¹

104 Accesses
Explore all metrics

Abstract

Objective

Convolutional Neural Networks (CNNs) have become essential tools for classroom student behavior recognition but lack the capability of global information capturing. In recent years, Vision Transformer (ViT) has demonstrated strong global modeling capabilities, which can be employed to strengthen the multi-level spatial information representation ability of classroom student behavior recognition models.

Methods

First, Cascaded Residual Vision Transformer (CR-ViT) model was proposed. The outputs of residual convolutional layers were integrated into multiple ViT modules to learn both shallow and deep feature representations for multi-level spatial information extraction, followed with the LSTM network for further capturing the global dependency between the sequences of ViT modules. Second, Cascaded Residual Vision Transformer with Morlet wavelet (MCR-ViT) was proposed. Based on CR-ViT, morlet wavelet transform activation layers were employed to improve the sensibility for variations of edges in feature maps.

Results

The proposed methods were validated on our collected dataset named Student Behavior in Classroom (SBIC), as well as the publicly available dataset of Student Classroom Behavior (SCB). The CR-ViT model and the MCR-ViT model improved the accuracy on SBIC by 6.10% and 7.32%, and on SCB by 0.92% and 1.14%. The MCR-ViT with second-order derivative of Morlet wavelet achieved the highest accuracy improvement compared to its variants based on other wavelet.

Conclusion and significance

both CR-ViT and MCR-ViT exhibit superior performance, which can be leveraged to build high-performance student behavior recognition systems in classrooms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Fig. 11

A visual intelligent system for students’ behavior classification using body pose and facial features in a smart classroom

Article 11 August 2023

CT-Mixer: Exploiting Multiscale Design for Local-Global Representations Learning

SmartClassVue: Video-Based Automated Student Behavior Monitoring in Offline Classrooms

Data availability

The data collected for this study involves student information, including facial data, which raises privacy concerns and thus cannot be made publicly available. The data can be provided by the authors upon reasonable request.

Public datasets can be accessed from: https://www.kaggle.com/datasets/kaiyueyyds/dataset-of-student-classroom-behavior.

Code: https://github.com/HSS-XiaoTian/MCRViT

References

Zhang H, Nan Z, Yang T, et al. (2020) A driving behavior recognition model with bi-LSTM and multi-scale CNN. 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE 284–289. https://doi.org/10.1109/IV47402.2020.9304772
Wu H, Ma X, Li Y (2025) Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos. Signal Process Image Commun 131:117244. https://doi.org/10.1016/j.image.2024.117244
Article Google Scholar
Jisi A, Yin S (2021) A new feature fusion network for student behavior recognition in education. J Appl Sci Eng 24(2):133–140
Google Scholar
Tang H, Chen Y, Wang T et al (2024) HTC-Net: A hybrid CNN-transformer framework for medical image segmentation. Biomed Signal Process Control 88:105605. https://doi.org/10.1016/j.bspc.2023.105331
Article Google Scholar
Zhang H, Lian J, Yi Z et al (2024) HAU-net: hybrid CNN-transformer for breast ultrasound image segmentation. Biomed Signal Process Control 87:105427. https://doi.org/10.1016/j.bspc.2023.105427
Article Google Scholar
Liu Y, Shao Z, Hoffmann N (2021) Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561. https://doi.org/10.48550/arXiv.2112.05561
Cao Y, Xu J, Lin S et al (2020) Global context networks. IEEE Trans Pattern Anal Mach Intell 45(6):6881–6895. https://doi.org/10.1109/TPAMI.2020.3047209
Article Google Scholar
Zheng R, Jiang F, Shen R (2020) Intelligent student behavior analysis system for real classrooms. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE 9244–9248. https://doi.org/10.1109/ICASSP40776.2020.9053457
Ren L, Li S, Chen C. Student classroom behavior detection method based on deep learning. 2024 4th international symposium on computer technology and information science (ISCTIS). IEEE, 2024: 104–109. https://doi.org/10.1109/ISCTIS63324.2024.10699088
Zhu H, Zhao J, Niu L (2022) An efficient model for student behavior recognition in classroom. 2022 International Conference on Intelligent Education and Intelligent Research (IEIR). IEEE 142–147. https://doi.org/10.1109/IEIR56323.2022.10050077
Yin H, Vahdat A, Alvarez JM, et al. (2022) A-vit: Adaptive tokens for efficient vision transformer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10809–10818
Pereira G A, Hussain M (2024) A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv preprint arXiv:2408.15178. https://doi.org/10.48550/arXiv.2408.15178
Sun H, Ma Y (2025) MAVit: a lightweight hybrid model with mutual attention mechanism for driver behavior recognition. Eng Appl Artif Intell 143:109921. https://doi.org/10.1016/j.engappai.2024.109921
Article Google Scholar
Han H, Zeng H, Kuang L et al (2024) A human activity recognition method based on vision transformer. Sci Rep 14(1):15310
Article Google Scholar
Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
Article Google Scholar
Khalifa IA, Keti F (2025) The role of image processing and deep learning in IoT-based systems: a comprehensive review. Eur J Appl Sci Eng Technol 3(1):165–179. https://doi.org/10.59324/ejaset.2025.3(1).15
Article Google Scholar
Zhang Y, Liu Y, Sun P et al (2020) IFCNN: a general image fusion framework based on convolutional neural network. Inf Fusion 54:99–118. https://doi.org/10.1016/j.inffus.2019.07.011
Article Google Scholar
Liu Z, Mao H, Wu C Y, et al. (2022) A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11976–11986
Maaz M, Shaker A, Cholakkal H, et al. (2022) Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. European conference on computer vision. Cham: Springer Nature Switzerland 3–20
Ding X, Zhang X, Han J, et al. (2022) Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11963–11975
Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 770–778
Xu W, Fu YL, Zhu D (2023) ResNet and its application to medical image processing: research progress and challenges. Comput Methods Programs Biomed 240:107660. https://doi.org/10.1016/j.cmpb.2023.107660
Article Google Scholar
Han P, Liu Y, Cheng Z (2021) Airport runway detection based on a combination of complex convolution and ResNet for PolSAR images. 2021 SAR in Big Data Era (BIGSARDATA). IEEE 1–4. https://doi.org/10.1109/BIGSARDATA53212.2021.9574366
Hatamizadeh A, Song J, Liu G, et al. (2024) Diffit: diffusion vision transformers for image generation. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 37–55
Ngo B H, Do-Tran N T, Nguyen T N, et al. (2024) Learning CNN on ViT: a hybrid model to explicitly class-specific boundaries for domain adaptation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 28545–28554
Wang Z, Li T, Zheng J Q, et al. (2022) When cnn meet with vit: Towards semi-supervised learning for multi-class medical image semantic segmentation. European Conference on Computer Vision. Cham: Springer Nature Switzerland 424–441
Azzouz A, Bengherbia B, Wira P et al (2024) An efficient ECG signals denoising technique based on the combination of particle swarm optimisation and wavelet transform. Heliyon. https://doi.org/10.1016/j.heliyon.2024.e26171
Article Google Scholar
Liu W, Yan Q, Zhao Y (2020) Densely self-guided wavelet network for image denoising.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 432–433.
Lee H, Jo Y, Hong I, et al. (2024) MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation. arXiv preprint. arXiv:2412.03039. https://doi.org/10.48550/arXiv.2412.03039
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. International conference on machine learning. PMLR 10096–10106
Ma, Ningning, et al. (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV)
Liu Z, Hao Z, Han K, et al. (2024) Ghostnetv3: Exploring the training strategies for compact models. arXiv preprint. arXiv:2404.11202. https://doi.org/10.48550/arXiv.2404.11202
Yun S, Ro Y (2024) Shvit: Single-head vision transformer with memory efficient macro design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5756–5767
Yu W, Wang X (2025) Mambaout: Do we really need mamba for vision?. Proceedings of the Computer Vision and Pattern Recognition Conference 4484–4496
Dosovitskiy A, Beyer L, Kolesnikov A, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. arXiv:2010.11929
Zhang J, Liu S, Bian K, et al. (2025) A separable self-attention inspired by the state space model for computer vision. arxiv preprint. arxiv:2501.02040
Kuklin, V. Z., Ivanov, N. Z., Muranov, A. N, et al. Improving the reliability of biometric authentication processes using a model for reducing data drift[J]. Emerging Science Journal, 2024, 8(6): 2449-2464. https://doi.org/10.28991/ESJ-2024-08-06-018
Ha N Y Y, Ong L Y, Leow M C. Slowfast-tcn: a deep learning approach for visual speech recognition[J]. Emerging Science Journal, 2024, 8(6): 2554-2569. https://doi.org/10.28991/ESJ-2024-08-06-024
Al-Kharaz A A, Alwahhab A B A, Sabeeh V. Innovative date fruit classifier based on scatter wavelet and stacking ensemble[J]. HighTech and Innovation Journal, 2024, 5(2): 361-381. https://doi.org/10.28991/HIJ-2024-05-02-010
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976-11986.
Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
Chen C F R, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 357-366.

Download references

Acknowledgements

This research is supported by Key Project of the Ministry of Education of National Education Science Planning (DCA220448).

Author information

Authors and Affiliations

Department of Computer Science, Capital University of Economics and Business, Beijing, 100070, China
Jing-Wei Liu, Hao-Tian Ren, Yu-Ran Du & Jing Zhang
College of Computer Science, Beijing University of Technology, Beijing, 100124, China
Jing-Wei Liu & Jia-Ming Chen

Authors

Jing-Wei Liu
View author publications
Search author on:PubMed Google Scholar
Hao-Tian Ren
View author publications
Search author on:PubMed Google Scholar
Yu-Ran Du
View author publications
Search author on:PubMed Google Scholar
Jia-Ming Chen
View author publications
Search author on:PubMed Google Scholar
Jing Zhang
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jia-Ming Chen.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Disclosures

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, JW., Ren, HT., Du, YR. et al. A cascaded residual vision transformer with wavelet transform and application in behavior recognition. Appl Intell 56, 145 (2026). https://doi.org/10.1007/s10489-025-07002-2

Download citation

Received: 01 May 2025
Accepted: 12 November 2025
Published: 11 March 2026
Version of record: 11 March 2026
DOI: https://doi.org/10.1007/s10489-025-07002-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

A cascaded residual vision transformer with wavelet transform and application in behavior recognition

Abstract

Objective

Methods

Results

Conclusion and significance

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A visual intelligent system for students’ behavior classification using body pose and facial features in a smart classroom

CT-Mixer: Exploiting Multiscale Design for Local-Global Representations Learning

SmartClassVue: Video-Based Automated Student Behavior Monitoring in Offline Classrooms

Explore related subjects

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Disclosures

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now