Abstract
Objective
Convolutional Neural Networks (CNNs) have become essential tools for classroom student behavior recognition but lack the capability of global information capturing. In recent years, Vision Transformer (ViT) has demonstrated strong global modeling capabilities, which can be employed to strengthen the multi-level spatial information representation ability of classroom student behavior recognition models.
Methods
First, Cascaded Residual Vision Transformer (CR-ViT) model was proposed. The outputs of residual convolutional layers were integrated into multiple ViT modules to learn both shallow and deep feature representations for multi-level spatial information extraction, followed with the LSTM network for further capturing the global dependency between the sequences of ViT modules. Second, Cascaded Residual Vision Transformer with Morlet wavelet (MCR-ViT) was proposed. Based on CR-ViT, morlet wavelet transform activation layers were employed to improve the sensibility for variations of edges in feature maps.
Results
The proposed methods were validated on our collected dataset named Student Behavior in Classroom (SBIC), as well as the publicly available dataset of Student Classroom Behavior (SCB). The CR-ViT model and the MCR-ViT model improved the accuracy on SBIC by 6.10% and 7.32%, and on SCB by 0.92% and 1.14%. The MCR-ViT with second-order derivative of Morlet wavelet achieved the highest accuracy improvement compared to its variants based on other wavelet.
Conclusion and significance
both CR-ViT and MCR-ViT exhibit superior performance, which can be leveraged to build high-performance student behavior recognition systems in classrooms.












Similar content being viewed by others
Data availability
The data collected for this study involves student information, including facial data, which raises privacy concerns and thus cannot be made publicly available. The data can be provided by the authors upon reasonable request.
Public datasets can be accessed from: https://www.kaggle.com/datasets/kaiyueyyds/dataset-of-student-classroom-behavior.
Code: https://github.com/HSS-XiaoTian/MCRViT
References
Zhang H, Nan Z, Yang T, et al. (2020) A driving behavior recognition model with bi-LSTM and multi-scale CNN. 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE 284–289. https://doi.org/10.1109/IV47402.2020.9304772
Wu H, Ma X, Li Y (2025) Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos. Signal Process Image Commun 131:117244. https://doi.org/10.1016/j.image.2024.117244
Jisi A, Yin S (2021) A new feature fusion network for student behavior recognition in education. J Appl Sci Eng 24(2):133–140
Tang H, Chen Y, Wang T et al (2024) HTC-Net: A hybrid CNN-transformer framework for medical image segmentation. Biomed Signal Process Control 88:105605. https://doi.org/10.1016/j.bspc.2023.105331
Zhang H, Lian J, Yi Z et al (2024) HAU-net: hybrid CNN-transformer for breast ultrasound image segmentation. Biomed Signal Process Control 87:105427. https://doi.org/10.1016/j.bspc.2023.105427
Liu Y, Shao Z, Hoffmann N (2021) Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561. https://doi.org/10.48550/arXiv.2112.05561
Cao Y, Xu J, Lin S et al (2020) Global context networks. IEEE Trans Pattern Anal Mach Intell 45(6):6881–6895. https://doi.org/10.1109/TPAMI.2020.3047209
Zheng R, Jiang F, Shen R (2020) Intelligent student behavior analysis system for real classrooms. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE 9244–9248. https://doi.org/10.1109/ICASSP40776.2020.9053457
Ren L, Li S, Chen C. Student classroom behavior detection method based on deep learning. 2024 4th international symposium on computer technology and information science (ISCTIS). IEEE, 2024: 104–109. https://doi.org/10.1109/ISCTIS63324.2024.10699088
Zhu H, Zhao J, Niu L (2022) An efficient model for student behavior recognition in classroom. 2022 International Conference on Intelligent Education and Intelligent Research (IEIR). IEEE 142–147. https://doi.org/10.1109/IEIR56323.2022.10050077
Yin H, Vahdat A, Alvarez JM, et al. (2022) A-vit: Adaptive tokens for efficient vision transformer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10809–10818
Pereira G A, Hussain M (2024) A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv preprint arXiv:2408.15178. https://doi.org/10.48550/arXiv.2408.15178
Sun H, Ma Y (2025) MAVit: a lightweight hybrid model with mutual attention mechanism for driver behavior recognition. Eng Appl Artif Intell 143:109921. https://doi.org/10.1016/j.engappai.2024.109921
Han H, Zeng H, Kuang L et al (2024) A human activity recognition method based on vision transformer. Sci Rep 14(1):15310
Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
Khalifa IA, Keti F (2025) The role of image processing and deep learning in IoT-based systems: a comprehensive review. Eur J Appl Sci Eng Technol 3(1):165–179. https://doi.org/10.59324/ejaset.2025.3(1).15
Zhang Y, Liu Y, Sun P et al (2020) IFCNN: a general image fusion framework based on convolutional neural network. Inf Fusion 54:99–118. https://doi.org/10.1016/j.inffus.2019.07.011
Liu Z, Mao H, Wu C Y, et al. (2022) A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11976–11986
Maaz M, Shaker A, Cholakkal H, et al. (2022) Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. European conference on computer vision. Cham: Springer Nature Switzerland 3–20
Ding X, Zhang X, Han J, et al. (2022) Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11963–11975
Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556
He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 770–778
Xu W, Fu YL, Zhu D (2023) ResNet and its application to medical image processing: research progress and challenges. Comput Methods Programs Biomed 240:107660. https://doi.org/10.1016/j.cmpb.2023.107660
Han P, Liu Y, Cheng Z (2021) Airport runway detection based on a combination of complex convolution and ResNet for PolSAR images. 2021 SAR in Big Data Era (BIGSARDATA). IEEE 1–4. https://doi.org/10.1109/BIGSARDATA53212.2021.9574366
Hatamizadeh A, Song J, Liu G, et al. (2024) Diffit: diffusion vision transformers for image generation. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 37–55
Ngo B H, Do-Tran N T, Nguyen T N, et al. (2024) Learning CNN on ViT: a hybrid model to explicitly class-specific boundaries for domain adaptation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 28545–28554
Wang Z, Li T, Zheng J Q, et al. (2022) When cnn meet with vit: Towards semi-supervised learning for multi-class medical image semantic segmentation. European Conference on Computer Vision. Cham: Springer Nature Switzerland 424–441
Azzouz A, Bengherbia B, Wira P et al (2024) An efficient ECG signals denoising technique based on the combination of particle swarm optimisation and wavelet transform. Heliyon. https://doi.org/10.1016/j.heliyon.2024.e26171
Liu W, Yan Q, Zhao Y (2020) Densely self-guided wavelet network for image denoising.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 432–433.
Lee H, Jo Y, Hong I, et al. (2024) MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation. arXiv preprint. arXiv:2412.03039. https://doi.org/10.48550/arXiv.2412.03039
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. International conference on machine learning. PMLR 10096–10106
Ma, Ningning, et al. (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV)
Liu Z, Hao Z, Han K, et al. (2024) Ghostnetv3: Exploring the training strategies for compact models. arXiv preprint. arXiv:2404.11202. https://doi.org/10.48550/arXiv.2404.11202
Yun S, Ro Y (2024) Shvit: Single-head vision transformer with memory efficient macro design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5756–5767
Yu W, Wang X (2025) Mambaout: Do we really need mamba for vision?. Proceedings of the Computer Vision and Pattern Recognition Conference 4484–4496
Dosovitskiy A, Beyer L, Kolesnikov A, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. arXiv:2010.11929
Zhang J, Liu S, Bian K, et al. (2025) A separable self-attention inspired by the state space model for computer vision. arxiv preprint. arxiv:2501.02040
Kuklin, V. Z., Ivanov, N. Z., Muranov, A. N, et al. Improving the reliability of biometric authentication processes using a model for reducing data drift[J]. Emerging Science Journal, 2024, 8(6): 2449-2464. https://doi.org/10.28991/ESJ-2024-08-06-018
Ha N Y Y, Ong L Y, Leow M C. Slowfast-tcn: a deep learning approach for visual speech recognition[J]. Emerging Science Journal, 2024, 8(6): 2554-2569. https://doi.org/10.28991/ESJ-2024-08-06-024
Al-Kharaz A A, Alwahhab A B A, Sabeeh V. Innovative date fruit classifier based on scatter wavelet and stacking ensemble[J]. HighTech and Innovation Journal, 2024, 5(2): 361-381. https://doi.org/10.28991/HIJ-2024-05-02-010
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976-11986.
Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
Chen C F R, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 357-366.
Acknowledgements
This research is supported by Key Project of the Ministry of Education of National Education Science Planning (DCA220448).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Disclosures
The authors declare no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, JW., Ren, HT., Du, YR. et al. A cascaded residual vision transformer with wavelet transform and application in behavior recognition. Appl Intell 56, 145 (2026). https://doi.org/10.1007/s10489-025-07002-2
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s10489-025-07002-2
