Skip to main content
Log in

A cascaded residual vision transformer with wavelet transform and application in behavior recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Objective

Convolutional Neural Networks (CNNs) have become essential tools for classroom student behavior recognition but lack the capability of global information capturing. In recent years, Vision Transformer (ViT) has demonstrated strong global modeling capabilities, which can be employed to strengthen the multi-level spatial information representation ability of classroom student behavior recognition models.

Methods

First, Cascaded Residual Vision Transformer (CR-ViT) model was proposed. The outputs of residual convolutional layers were integrated into multiple ViT modules to learn both shallow and deep feature representations for multi-level spatial information extraction, followed with the LSTM network for further capturing the global dependency between the sequences of ViT modules. Second, Cascaded Residual Vision Transformer with Morlet wavelet (MCR-ViT) was proposed. Based on CR-ViT, morlet wavelet transform activation layers were employed to improve the sensibility for variations of edges in feature maps.

Results

The proposed methods were validated on our collected dataset named Student Behavior in Classroom (SBIC), as well as the publicly available dataset of Student Classroom Behavior (SCB). The CR-ViT model and the MCR-ViT model improved the accuracy on SBIC by 6.10% and 7.32%, and on SCB by 0.92% and 1.14%. The MCR-ViT with second-order derivative of Morlet wavelet achieved the highest accuracy improvement compared to its variants based on other wavelet.

Conclusion and significance

both CR-ViT and MCR-ViT exhibit superior performance, which can be leveraged to build high-performance student behavior recognition systems in classrooms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.
Fig. 9
The alternative text for this image may have been generated using AI.
Fig. 10
The alternative text for this image may have been generated using AI.
Fig. 11
The alternative text for this image may have been generated using AI.
Fig. 12
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The data collected for this study involves student information, including facial data, which raises privacy concerns and thus cannot be made publicly available. The data can be provided by the authors upon reasonable request.

Public datasets can be accessed from: https://www.kaggle.com/datasets/kaiyueyyds/dataset-of-student-classroom-behavior.

Code: https://github.com/HSS-XiaoTian/MCRViT

References

  1. Zhang H, Nan Z, Yang T, et al. (2020) A driving behavior recognition model with bi-LSTM and multi-scale CNN. 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE 284–289. https://doi.org/10.1109/IV47402.2020.9304772

  2. Wu H, Ma X, Li Y (2025) Transformer-based multiview spatiotemporal feature interactive fusion for human action recognition in depth videos. Signal Process Image Commun 131:117244. https://doi.org/10.1016/j.image.2024.117244

    Article  Google Scholar 

  3. Jisi A, Yin S (2021) A new feature fusion network for student behavior recognition in education. J Appl Sci Eng 24(2):133–140

    Google Scholar 

  4. Tang H, Chen Y, Wang T et al (2024) HTC-Net: A hybrid CNN-transformer framework for medical image segmentation. Biomed Signal Process Control 88:105605. https://doi.org/10.1016/j.bspc.2023.105331

    Article  Google Scholar 

  5. Zhang H, Lian J, Yi Z et al (2024) HAU-net: hybrid CNN-transformer for breast ultrasound image segmentation. Biomed Signal Process Control 87:105427. https://doi.org/10.1016/j.bspc.2023.105427

    Article  Google Scholar 

  6. Liu Y, Shao Z, Hoffmann N (2021) Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561. https://doi.org/10.48550/arXiv.2112.05561

  7. Cao Y, Xu J, Lin S et al (2020) Global context networks. IEEE Trans Pattern Anal Mach Intell 45(6):6881–6895. https://doi.org/10.1109/TPAMI.2020.3047209

    Article  Google Scholar 

  8. Zheng R, Jiang F, Shen R (2020) Intelligent student behavior analysis system for real classrooms. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE 9244–9248. https://doi.org/10.1109/ICASSP40776.2020.9053457

  9. Ren L, Li S, Chen C. Student classroom behavior detection method based on deep learning. 2024 4th international symposium on computer technology and information science (ISCTIS). IEEE, 2024: 104–109. https://doi.org/10.1109/ISCTIS63324.2024.10699088

  10. Zhu H, Zhao J, Niu L (2022) An efficient model for student behavior recognition in classroom. 2022 International Conference on Intelligent Education and Intelligent Research (IEIR). IEEE 142–147. https://doi.org/10.1109/IEIR56323.2022.10050077

  11. Yin H, Vahdat A, Alvarez JM, et al. (2022) A-vit: Adaptive tokens for efficient vision transformer. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 10809–10818

  12. Pereira G A, Hussain M (2024) A review of transformer-based models for computer vision tasks: Capturing global context and spatial relationships. arXiv preprint arXiv:2408.15178. https://doi.org/10.48550/arXiv.2408.15178

  13. Sun H, Ma Y (2025) MAVit: a lightweight hybrid model with mutual attention mechanism for driver behavior recognition. Eng Appl Artif Intell 143:109921. https://doi.org/10.1016/j.engappai.2024.109921

    Article  Google Scholar 

  14. Han H, Zeng H, Kuang L et al (2024) A human activity recognition method based on vision transformer. Sci Rep 14(1):15310

    Article  Google Scholar 

  15. Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocomputing 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091

    Article  Google Scholar 

  16. Khalifa IA, Keti F (2025) The role of image processing and deep learning in IoT-based systems: a comprehensive review. Eur J Appl Sci Eng Technol 3(1):165–179. https://doi.org/10.59324/ejaset.2025.3(1).15

    Article  Google Scholar 

  17. Zhang Y, Liu Y, Sun P et al (2020) IFCNN: a general image fusion framework based on convolutional neural network. Inf Fusion 54:99–118. https://doi.org/10.1016/j.inffus.2019.07.011

    Article  Google Scholar 

  18. Liu Z, Mao H, Wu C Y, et al. (2022) A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11976–11986

  19. Maaz M, Shaker A, Cholakkal H, et al. (2022) Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. European conference on computer vision. Cham: Springer Nature Switzerland 3–20

  20. Ding X, Zhang X, Han J, et al. (2022) Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11963–11975

  21. Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint. arXiv:1409.1556. https://doi.org/10.48550/arXiv.1409.1556

  22. He K, Zhang X, Ren S, et al. (2016) Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition 770–778

  23. Xu W, Fu YL, Zhu D (2023) ResNet and its application to medical image processing: research progress and challenges. Comput Methods Programs Biomed 240:107660. https://doi.org/10.1016/j.cmpb.2023.107660

    Article  Google Scholar 

  24. Han P, Liu Y, Cheng Z (2021) Airport runway detection based on a combination of complex convolution and ResNet for PolSAR images. 2021 SAR in Big Data Era (BIGSARDATA). IEEE 1–4. https://doi.org/10.1109/BIGSARDATA53212.2021.9574366

  25. Hatamizadeh A, Song J, Liu G, et al. (2024) Diffit: diffusion vision transformers for image generation. European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 37–55

  26. Ngo B H, Do-Tran N T, Nguyen T N, et al. (2024) Learning CNN on ViT: a hybrid model to explicitly class-specific boundaries for domain adaptation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 28545–28554

  27. Wang Z, Li T, Zheng J Q, et al. (2022) When cnn meet with vit: Towards semi-supervised learning for multi-class medical image semantic segmentation. European Conference on Computer Vision. Cham: Springer Nature Switzerland 424–441

  28. Azzouz A, Bengherbia B, Wira P et al (2024) An efficient ECG signals denoising technique based on the combination of particle swarm optimisation and wavelet transform. Heliyon. https://doi.org/10.1016/j.heliyon.2024.e26171

    Article  Google Scholar 

  29. Liu W, Yan Q, Zhao Y (2020) Densely self-guided wavelet network for image denoising.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 432–433.

  30. Lee H, Jo Y, Hong I, et al. (2024) MRNet: Multifaceted Resilient Networks for Medical Image-to-Image Translation. arXiv preprint. arXiv:2412.03039. https://doi.org/10.48550/arXiv.2412.03039

  31. Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. International conference on machine learning. PMLR 10096–10106

  32. Ma, Ningning, et al. (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European conference on computer vision (ECCV)

  33. Liu Z, Hao Z, Han K, et al. (2024) Ghostnetv3: Exploring the training strategies for compact models. arXiv preprint. arXiv:2404.11202. https://doi.org/10.48550/arXiv.2404.11202

  34. Yun S, Ro Y (2024) Shvit: Single-head vision transformer with memory efficient macro design. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 5756–5767

  35. Yu W, Wang X (2025) Mambaout: Do we really need mamba for vision?. Proceedings of the Computer Vision and Pattern Recognition Conference 4484–4496

  36. Dosovitskiy A, Beyer L, Kolesnikov A, et al. (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint. arXiv:2010.11929

  37. Zhang J, Liu S, Bian K, et al. (2025) A separable self-attention inspired by the state space model for computer vision. arxiv preprint. arxiv:2501.02040

  38. Kuklin, V. Z., Ivanov, N. Z., Muranov, A. N, et al. Improving the reliability of biometric authentication processes using a model for reducing data drift[J]. Emerging Science Journal, 2024, 8(6): 2449-2464. https://doi.org/10.28991/ESJ-2024-08-06-018

  39. Ha N Y Y, Ong L Y, Leow M C. Slowfast-tcn: a deep learning approach for visual speech recognition[J]. Emerging Science Journal, 2024, 8(6): 2554-2569. https://doi.org/10.28991/ESJ-2024-08-06-024

  40. Al-Kharaz A A, Alwahhab A B A, Sabeeh V. Innovative date fruit classifier based on scatter wavelet and stacking ensemble[J]. HighTech and Innovation Journal, 2024, 5(2): 361-381. https://doi.org/10.28991/HIJ-2024-05-02-010

  41. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.

  42. Liu Z, Mao H, Wu C Y, et al. A convnet for the 2020s[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 11976-11986.

  43. Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.

  44. Chen C F R, Fan Q, Panda R. Crossvit: Cross-attention multi-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 357-366.

Download references

Acknowledgements

This research is supported by Key Project of the Ministry of Education of National Education Science Planning (DCA220448).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia-Ming Chen.

Ethics declarations

Competing interests

The authors declare no conflict of interest.

Disclosures

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, JW., Ren, HT., Du, YR. et al. A cascaded residual vision transformer with wavelet transform and application in behavior recognition. Appl Intell 56, 145 (2026). https://doi.org/10.1007/s10489-025-07002-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • DOI: https://doi.org/10.1007/s10489-025-07002-2

Keywords