Abstract
The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.







Similar content being viewed by others
Data availability
The dataset supporting the conclusions of this article is not publicly available. Researchers interested in accessing the dataset for research purposes should contact the authors directly.
References
Shahmoradi S, Shouraki SB (2018) Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition. Appl Soft Comput 62:315–327
Yogesh C, Hariharan M, Ngadiran R, Adom AH, Yaacob S, Polat K (2017) Hybrid bbo_pso and higher order spectral features for emotion and stress recognition from natural speech. Appl Soft Comput 56:217–232
Hamsa S, Shahin I, Iraqi Y, Damiani E, Nassif AB, Werghi N (2023) Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst Appl 224:119871
Nassif AB, Shahin I, Elnagar A, Velayudhan D, Alhudhaif A, Polat K (2022) Emotional speaker identification using a novel capsule nets model. Expert Syst Appl 193:116469
Bader M, Shahin I, Ahmed A, Werghi N (2022) Hybrid CNN-LSTM speaker identification framework for evaluating the impact of face masks. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 118–121
Bader M, Shahin I, Ahmed A, Werghi N (2022) Studying the effect of face masks in identifying speakers using lstm. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 99–102
Subakan C, Ravanelli M, Cornell S, Grondin F, Bronzi M (2022) On using transformers for speech-separation. arXiv:2202.02884
Shahin IM (2016) Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden Markov models. Circ Syst Signal Process 35:3770–3792
Wittum KJ, Feth L, Hoglund E (2013) The effects of surgical masks on speech perception in noise. In: Proceedings of meetings on acoustics, vol. 19, AIP Publishing
Geng P, Lu Q, Guo H, Zeng J (2023) The effects of face mask on speech production and its implication for forensic speaker identification-a cross-linguistic study. PloS One 18(3):0283724
Yang Z, An Z, Fan Z, Jing C, Cao H (2020) Exploration of acoustic and lexical cues for the interspeech 2020 computational paralinguistic challenge. INTERSPEECH 2020
Khan AA, Jahangir R, Alroobaea R, Alyahyan SY, Almulhi AH, Alsafyani M, Wechtaisong C (2023) An efficient text-independent speaker identification using feature fusion and transformer model. Comput Mater Contin 75:4085–4100
De Jong NH, Wempe T (2009) Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods 41(2):385–390
Ibrahim YA, Odiketa JC, Ibiyemi TS (2017) Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Ann Comput Sci Ser 15(1):186–191
Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87
Tzirakis P, Zhang J, Schuller B (2018) End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP, Calgary, AB, Canada, pp 15–20
Kurzekar PK, Deshmukh RR, Waghmare VB, Shrishrimal PP (2014) A comparative study of feature extraction techniques for speech recognition system. Int J Innov Res Sci Eng Technol 3(12):18006–18016
Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323
Ishizuka K, Nakatani T, Minami Y, Miyazaki N (2006) Speech feature extraction method using subband-based periodicity and nonperiodicity decomposition. J Acous Soc America 120(1):443–452
Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv:1003.4083
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Abdalla MI, Ali HS (2010) Wavelet-based mel-frequency cepstral coefficients for speaker identification using hidden markov models. arXiv:1003.5627
Ranjan R, Thakur A (2019) Analysis of feature extraction techniques for speech recognition system. Int J Innov Technol Explor Eng 8(7C2):197–200
Bachu RG, Kopparthi S, Adapa B, Barkana BD (2010) Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy. In: Advanced techniques in computing sciences and software engineering, Springer, pp 279–282
Kos M, Kačič Z, Vlaj D (2013) Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit Signal Process 23(2):659–674
Staudinger T, Polikar R (2011) Analysis of complexity based eeg features for the diagnosis of alzheimer’s disease. In: 2011 annual international conference of the ieee engineering in medicine and biology society, IEEE, pp 2033–2036
Thornton B (2019) Audio recognition using mel spectrograms and convolution neural networks
Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653
Shah A, Kattel M, Nepal A, Shrestha D (2019) Chroma feature extraction: chroma feature extraction using fourier transform
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158:107020
Bhangale K, Kothandaraman M (2023) Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics 12(4):839
Vivek V, Vidhya S, Madhanmohan P (2020) Acoustic scene classification in hearing aid using deep learning. In: 2020 International Conference on Communication and Signal Processing (ICCSP), pp. 0695–0699. IEEE
Veltman A, Pulle DW, De Doncker RW, Veltman A, Pulle DW, De Doncker RW (2016) The transformer. In: Fundamentals of electrical drives, pp 47–82
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Gong C, Wang D, Li M, Chandra V, Liu Q (2021) Vision transformers with patch diversification. arXiv:2104.12753
Tay Y, Dehghani M, Bahri D, Metzler D (2022) Efficient transformers: a survey. ACM Comput Surv 55(6):1–28. https://doi.org/10.1145/3530811
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proc Adv Neural Inf Process Syst, vol 30
Dong L, Xu S, Xu B (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(1):5485–5551
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Borhani Y, Khoramdel J, Najafi E (2022) A deep learning based approach for automated plant disease classification using vision transformer. Sci Rep 12(1):11554
Zuo S, Xiao Y, Chang X, Wang X (2022) Vision transformers for dense prediction: a survey. Knowl Based Syst 253:109552
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, PMLR, pp 10347–10357
Ulhaq A, Akhtar N, Pogrebna G, Mian A (2022) Vision transformers for action recognition: a survey. arXiv:2209.05700
Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS One 13(5):0196391
Abunasser BS, AL-Hiealy MRJ, Barhoom AM, Almasri AR, Abu-Naser SS (2022) Prediction of instructor performance using machine and deep learning techniques. Int J Adv Comput Sci Appl 13(7). https://doi.org/10.14569/IJACSA.2022.0130711
Chiche A, Yitagesu B (2022) Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data 9(1):1–25
Cahuantzi R, Chen X, Güttel S (2023) A comparison of LSTM and GRU networks for learning symbolic sequences. In: Arai K (ed) Intelligent computing. SAI 2023. Lecture notes in networks and systems, vol 739. Springer, Cham, pp 771–785. https://doi.org/10.1007/978-3-031-37963-5_53
Al-Dulaimi HW, Aldhahab A, Al Abboodi HM (2023) Speaker identification system employing multi-resolution analysis in conjunction with CNN. Int J Intell Eng Syst 16(5):350–361
Sefara TJ, Mokgonyane TB (2020) Emotional speaker recognition based on machine and deep learning. In: 2020 2nd international multidisciplinary information technology and engineering conference (IMITEC), IEEE, pp 1–8
Al Hindawi NA, Shahin I, Nassif AB (2021) Speaker identification for disguised voices based on modified SVM classifier. In: 2021 18th international multi-conference on systems, signals & devices (SSD), IEEE, pp 687–691
Pereira DG, Afonso A, Medeiros FM (2015) Overview of Friedman’s test and post-hoc analysis. Commun Stat Simul Comput 44(10):2636–2653
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Consent for publication
Not applicable.
Informed consent
This study does not involve any experiments on animals.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nassif, A.B., Shahin, I., Bader, M. et al. ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection. Neural Comput & Applic 36, 22569–22586 (2024). https://doi.org/10.1007/s00521-024-10389-7
Received:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s00521-024-10389-7
