ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection

Nassif, Ali Bou; Shahin, Ismail; Bader, Mohamed; Ahmed, Abdelfatah; Werghi, Naoufel

doi:10.1007/s00521-024-10389-7

ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection

Original Article
Published: 23 September 2024

Volume 36, pages 22569–22586, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ali Bou Nassif ORCID: orcid.org/0000-0003-1570-0897¹,
Ismail Shahin²,
Mohamed Bader²,
Abdelfatah Ahmed³ &
…
Naoufel Werghi³

347 Accesses
2 Citations
Explore all metrics

Abstract

The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Study on the Effect of Face Masks on Forensic Speaker Recognition

Face mask effects on speaker verification performance in the presence of noise

Article 29 May 2023

Identifying Subjects Wearing a Mask from the Speech by Means of Encoded Speech Representations

Data availability

The dataset supporting the conclusions of this article is not publicly available. Researchers interested in accessing the dataset for research purposes should contact the authors directly.

References

Shahmoradi S, Shouraki SB (2018) Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition. Appl Soft Comput 62:315–327
Article Google Scholar
Yogesh C, Hariharan M, Ngadiran R, Adom AH, Yaacob S, Polat K (2017) Hybrid bbo_pso and higher order spectral features for emotion and stress recognition from natural speech. Appl Soft Comput 56:217–232
Article Google Scholar
Hamsa S, Shahin I, Iraqi Y, Damiani E, Nassif AB, Werghi N (2023) Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst Appl 224:119871
Article Google Scholar
Nassif AB, Shahin I, Elnagar A, Velayudhan D, Alhudhaif A, Polat K (2022) Emotional speaker identification using a novel capsule nets model. Expert Syst Appl 193:116469
Article Google Scholar
Bader M, Shahin I, Ahmed A, Werghi N (2022) Hybrid CNN-LSTM speaker identification framework for evaluating the impact of face masks. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 118–121
Bader M, Shahin I, Ahmed A, Werghi N (2022) Studying the effect of face masks in identifying speakers using lstm. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 99–102
Subakan C, Ravanelli M, Cornell S, Grondin F, Bronzi M (2022) On using transformers for speech-separation. arXiv:2202.02884
Shahin IM (2016) Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden Markov models. Circ Syst Signal Process 35:3770–3792
Article MathSciNet Google Scholar
Wittum KJ, Feth L, Hoglund E (2013) The effects of surgical masks on speech perception in noise. In: Proceedings of meetings on acoustics, vol. 19, AIP Publishing
Geng P, Lu Q, Guo H, Zeng J (2023) The effects of face mask on speech production and its implication for forensic speaker identification-a cross-linguistic study. PloS One 18(3):0283724
Article Google Scholar
Yang Z, An Z, Fan Z, Jing C, Cao H (2020) Exploration of acoustic and lexical cues for the interspeech 2020 computational paralinguistic challenge. INTERSPEECH 2020
Khan AA, Jahangir R, Alroobaea R, Alyahyan SY, Almulhi AH, Alsafyani M, Wechtaisong C (2023) An efficient text-independent speaker identification using feature fusion and transformer model. Comput Mater Contin 75:4085–4100
Google Scholar
De Jong NH, Wempe T (2009) Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods 41(2):385–390
Article Google Scholar
Ibrahim YA, Odiketa JC, Ibiyemi TS (2017) Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Ann Comput Sci Ser 15(1):186–191
Google Scholar
Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87
Article Google Scholar
Tzirakis P, Zhang J, Schuller B (2018) End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP, Calgary, AB, Canada, pp 15–20
Kurzekar PK, Deshmukh RR, Waghmare VB, Shrishrimal PP (2014) A comparative study of feature extraction techniques for speech recognition system. Int J Innov Res Sci Eng Technol 3(12):18006–18016
Article Google Scholar
Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323
Article Google Scholar
Ishizuka K, Nakatani T, Minami Y, Miyazaki N (2006) Speech feature extraction method using subband-based periodicity and nonperiodicity decomposition. J Acous Soc America 120(1):443–452
Article Google Scholar
Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv:1003.4083
Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40
Article Google Scholar
Abdalla MI, Ali HS (2010) Wavelet-based mel-frequency cepstral coefficients for speaker identification using hidden markov models. arXiv:1003.5627
Ranjan R, Thakur A (2019) Analysis of feature extraction techniques for speech recognition system. Int J Innov Technol Explor Eng 8(7C2):197–200
Google Scholar
Bachu RG, Kopparthi S, Adapa B, Barkana BD (2010) Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy. In: Advanced techniques in computing sciences and software engineering, Springer, pp 279–282
Kos M, Kačič Z, Vlaj D (2013) Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit Signal Process 23(2):659–674
Article MathSciNet Google Scholar
Staudinger T, Polikar R (2011) Analysis of complexity based eeg features for the diagnosis of alzheimer’s disease. In: 2011 annual international conference of the ieee engineering in medicine and biology society, IEEE, pp 2033–2036
Thornton B (2019) Audio recognition using mel spectrograms and convolution neural networks
Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653
Article Google Scholar
Shah A, Kattel M, Nepal A, Shrestha D (2019) Chroma feature extraction: chroma feature extraction using fourier transform
Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158:107020
Article Google Scholar
Bhangale K, Kothandaraman M (2023) Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics 12(4):839
Article Google Scholar
Vivek V, Vidhya S, Madhanmohan P (2020) Acoustic scene classification in hearing aid using deep learning. In: 2020 International Conference on Communication and Signal Processing (ICCSP), pp. 0695–0699. IEEE
Veltman A, Pulle DW, De Doncker RW, Veltman A, Pulle DW, De Doncker RW (2016) The transformer. In: Fundamentals of electrical drives, pp 47–82
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929
Gong C, Wang D, Li M, Chandra V, Liu Q (2021) Vision transformers with patch diversification. arXiv:2104.12753
Tay Y, Dehghani M, Bahri D, Metzler D (2022) Efficient transformers: a survey. ACM Comput Surv 55(6):1–28. https://doi.org/10.1145/3530811
Article Google Scholar
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proc Adv Neural Inf Process Syst, vol 30
Dong L, Xu S, Xu B (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(1):5485–5551
MathSciNet Google Scholar
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Borhani Y, Khoramdel J, Najafi E (2022) A deep learning based approach for automated plant disease classification using vision transformer. Sci Rep 12(1):11554
Article Google Scholar
Zuo S, Xiao Y, Chang X, Wang X (2022) Vision transformers for dense prediction: a survey. Knowl Based Syst 253:109552
Article Google Scholar
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, PMLR, pp 10347–10357
Ulhaq A, Akhtar N, Pogrebna G, Mian A (2022) Vision transformers for action recognition: a survey. arXiv:2209.05700
Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS One 13(5):0196391
Article Google Scholar
Abunasser BS, AL-Hiealy MRJ, Barhoom AM, Almasri AR, Abu-Naser SS (2022) Prediction of instructor performance using machine and deep learning techniques. Int J Adv Comput Sci Appl 13(7). https://doi.org/10.14569/IJACSA.2022.0130711
Chiche A, Yitagesu B (2022) Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data 9(1):1–25
Article Google Scholar
Cahuantzi R, Chen X, Güttel S (2023) A comparison of LSTM and GRU networks for learning symbolic sequences. In: Arai K (ed) Intelligent computing. SAI 2023. Lecture notes in networks and systems, vol 739. Springer, Cham, pp 771–785. https://doi.org/10.1007/978-3-031-37963-5_53
Chapter Google Scholar
Al-Dulaimi HW, Aldhahab A, Al Abboodi HM (2023) Speaker identification system employing multi-resolution analysis in conjunction with CNN. Int J Intell Eng Syst 16(5):350–361
Google Scholar
Sefara TJ, Mokgonyane TB (2020) Emotional speaker recognition based on machine and deep learning. In: 2020 2nd international multidisciplinary information technology and engineering conference (IMITEC), IEEE, pp 1–8
Al Hindawi NA, Shahin I, Nassif AB (2021) Speaker identification for disguised voices based on modified SVM classifier. In: 2021 18th international multi-conference on systems, signals & devices (SSD), IEEE, pp 687–691
Pereira DG, Afonso A, Medeiros FM (2015) Overview of Friedman’s test and post-hoc analysis. Commun Stat Simul Comput 44(10):2636–2653
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, University of Sharjah, Sharjah, UAE
Ali Bou Nassif
Department of Electrical Engineering, University of Sharjah, Sharjah, UAE
Ismail Shahin & Mohamed Bader
Center for Cyber-Physical Systems (C2PS), Department of Electrical Engineering and Computer Sciences, Khalifa University, Abu Dhabi, UAE
Abdelfatah Ahmed & Naoufel Werghi

Authors

Ali Bou Nassif
View author publications
Search author on:PubMed Google Scholar
Ismail Shahin
View author publications
Search author on:PubMed Google Scholar
Mohamed Bader
View author publications
Search author on:PubMed Google Scholar
Abdelfatah Ahmed
View author publications
Search author on:PubMed Google Scholar
Naoufel Werghi
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Ali Bou Nassif.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

Not applicable.

Informed consent

This study does not involve any experiments on animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Nassif, A.B., Shahin, I., Bader, M. et al. ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection. Neural Comput & Applic 36, 22569–22586 (2024). https://doi.org/10.1007/s00521-024-10389-7

Download citation

Received: 17 December 2023
Accepted: 22 August 2024
Published: 23 September 2024
Version of record: 23 September 2024
Issue date: December 2024
DOI: https://doi.org/10.1007/s00521-024-10389-7

Keywords

Profiles

Ali Bou Nassif View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Study on the Effect of Face Masks on Forensic Speaker Recognition

Face mask effects on speaker verification performance in the presence of noise

Identifying Subjects Wearing a Mask from the Speech by Means of Encoded Speech Representations

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Consent for publication

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now