Skip to main content
Log in

ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

The global health crisis caused by the COVID-19 pandemic has brought new challenges to speaker identification systems, particularly due to the acoustic alterations caused by the widespread use of face masks. Aiming to mitigate these distortions and improve the accuracy of speaker identification, this study introduces a novel two-level classification system, leveraging a unique integration of Vision Transformers (ViT) and Long Short-Term Memory (LSTM). This ViT-LSTM model was trained and tested on an extensive dataset composed of diverse speakers, both masked and unmasked, allowing a comprehensive evaluation of its capabilities. Our experimental results demonstrate remarkable improvements in speaker identification, with an accuracy score of 95.67%, significantly surpassing traditional and other deep learning-based methods. Moreover, our framework also shows considerable strength in detecting the presence of a mask, achieving an accuracy of 91.15% and outperforming existing state-of-the-art models. This study provides the first-ever benchmark for mask detection in the context of speaker identification, opening new pathways for research in this emerging area and presenting a robust solution for speaker identification in the era of face masks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

The dataset supporting the conclusions of this article is not publicly available. Researchers interested in accessing the dataset for research purposes should contact the authors directly.

References

  1. Shahmoradi S, Shouraki SB (2018) Evaluation of a novel fuzzy sequential pattern recognition tool (fuzzy elastic matching machine) and its applications in speech and handwriting recognition. Appl Soft Comput 62:315–327

    Article  Google Scholar 

  2. Yogesh C, Hariharan M, Ngadiran R, Adom AH, Yaacob S, Polat K (2017) Hybrid bbo_pso and higher order spectral features for emotion and stress recognition from natural speech. Appl Soft Comput 56:217–232

    Article  Google Scholar 

  3. Hamsa S, Shahin I, Iraqi Y, Damiani E, Nassif AB, Werghi N (2023) Speaker identification from emotional and noisy speech using learned voice segregation and speech VGG. Expert Syst Appl 224:119871

    Article  Google Scholar 

  4. Nassif AB, Shahin I, Elnagar A, Velayudhan D, Alhudhaif A, Polat K (2022) Emotional speaker identification using a novel capsule nets model. Expert Syst Appl 193:116469

    Article  Google Scholar 

  5. Bader M, Shahin I, Ahmed A, Werghi N (2022) Hybrid CNN-LSTM speaker identification framework for evaluating the impact of face masks. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 118–121

  6. Bader M, Shahin I, Ahmed A, Werghi N (2022) Studying the effect of face masks in identifying speakers using lstm. In: 2022 international conference on electrical and computing technologies and applications (ICECTA), IEEE, pp 99–102

  7. Subakan C, Ravanelli M, Cornell S, Grondin F, Bronzi M (2022) On using transformers for speech-separation. arXiv:2202.02884

  8. Shahin IM (2016) Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden Markov models. Circ Syst Signal Process 35:3770–3792

    Article  MathSciNet  Google Scholar 

  9. Wittum KJ, Feth L, Hoglund E (2013) The effects of surgical masks on speech perception in noise. In: Proceedings of meetings on acoustics, vol. 19, AIP Publishing

  10. Geng P, Lu Q, Guo H, Zeng J (2023) The effects of face mask on speech production and its implication for forensic speaker identification-a cross-linguistic study. PloS One 18(3):0283724

    Article  Google Scholar 

  11. Yang Z, An Z, Fan Z, Jing C, Cao H (2020) Exploration of acoustic and lexical cues for the interspeech 2020 computational paralinguistic challenge. INTERSPEECH 2020

  12. Khan AA, Jahangir R, Alroobaea R, Alyahyan SY, Almulhi AH, Alsafyani M, Wechtaisong C (2023) An efficient text-independent speaker identification using feature fusion and transformer model. Comput Mater Contin 75:4085–4100

    Google Scholar 

  13. De Jong NH, Wempe T (2009) Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods 41(2):385–390

    Article  Google Scholar 

  14. Ibrahim YA, Odiketa JC, Ibiyemi TS (2017) Preprocessing technique in automatic speech recognition for human computer interaction: an overview. Ann Comput Sci Ser 15(1):186–191

    Google Scholar 

  15. Domingos P (2012) A few useful things to know about machine learning. Commun ACM 55(10):78–87

    Article  Google Scholar 

  16. Tzirakis P, Zhang J, Schuller B (2018) End-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP, Calgary, AB, Canada, pp 15–20

  17. Kurzekar PK, Deshmukh RR, Waghmare VB, Shrishrimal PP (2014) A comparative study of feature extraction techniques for speech recognition system. Int J Innov Res Sci Eng Technol 3(12):18006–18016

    Article  Google Scholar 

  18. Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323

    Article  Google Scholar 

  19. Ishizuka K, Nakatani T, Minami Y, Miyazaki N (2006) Speech feature extraction method using subband-based periodicity and nonperiodicity decomposition. J Acous Soc America 120(1):443–452

    Article  Google Scholar 

  20. Muda L, Begam M, Elamvazuthi I (2010) Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv:1003.4083

  21. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52(1):12–40

    Article  Google Scholar 

  22. Abdalla MI, Ali HS (2010) Wavelet-based mel-frequency cepstral coefficients for speaker identification using hidden markov models. arXiv:1003.5627

  23. Ranjan R, Thakur A (2019) Analysis of feature extraction techniques for speech recognition system. Int J Innov Technol Explor Eng 8(7C2):197–200

    Google Scholar 

  24. Bachu RG, Kopparthi S, Adapa B, Barkana BD (2010) Voiced/unvoiced decision for speech signals based on zero-crossing rate and energy. In: Advanced techniques in computing sciences and software engineering, Springer, pp 279–282

  25. Kos M, Kačič Z, Vlaj D (2013) Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit Signal Process 23(2):659–674

    Article  MathSciNet  Google Scholar 

  26. Staudinger T, Polikar R (2011) Analysis of complexity based eeg features for the diagnosis of alzheimer’s disease. In: 2011 annual international conference of the ieee engineering in medicine and biology society, IEEE, pp 2033–2036

  27. Thornton B (2019) Audio recognition using mel spectrograms and convolution neural networks

  28. Er MB (2020) A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access 8:221640–221653

    Article  Google Scholar 

  29. Shah A, Kattel M, Nepal A, Shrestha D (2019) Chroma feature extraction: chroma feature extraction using fourier transform

  30. Sharma G, Umapathy K, Krishnan S (2020) Trends in audio signal feature extraction methods. Appl Acoust 158:107020

    Article  Google Scholar 

  31. Bhangale K, Kothandaraman M (2023) Speech emotion recognition based on multiple acoustic features and deep convolutional neural network. Electronics 12(4):839

    Article  Google Scholar 

  32. Vivek V, Vidhya S, Madhanmohan P (2020) Acoustic scene classification in hearing aid using deep learning. In: 2020 International Conference on Communication and Signal Processing (ICCSP), pp. 0695–0699. IEEE

  33. Veltman A, Pulle DW, De Doncker RW, Veltman A, Pulle DW, De Doncker RW (2016) The transformer. In: Fundamentals of electrical drives, pp 47–82

  34. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929

  35. Gong C, Wang D, Li M, Chandra V, Liu Q (2021) Vision transformers with patch diversification. arXiv:2104.12753

  36. Tay Y, Dehghani M, Bahri D, Metzler D (2022) Efficient transformers: a survey. ACM Comput Surv 55(6):1–28. https://doi.org/10.1145/3530811

    Article  Google Scholar 

  37. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450

  38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proc Adv Neural Inf Process Syst, vol 30

  39. Dong L, Xu S, Xu B (2018) Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506

  40. Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

  41. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(1):5485–5551

    MathSciNet  Google Scholar 

  42. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901

    Google Scholar 

  43. Borhani Y, Khoramdel J, Najafi E (2022) A deep learning based approach for automated plant disease classification using vision transformer. Sci Rep 12(1):11554

    Article  Google Scholar 

  44. Zuo S, Xiao Y, Chang X, Wang X (2022) Vision transformers for dense prediction: a survey. Knowl Based Syst 253:109552

    Article  Google Scholar 

  45. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, PMLR, pp 10347–10357

  46. Ulhaq A, Akhtar N, Pogrebna G, Mian A (2022) Vision transformers for action recognition: a survey. arXiv:2209.05700

  47. Livingstone SR, Russo FA (2018) The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS One 13(5):0196391

    Article  Google Scholar 

  48. Abunasser BS, AL-Hiealy MRJ, Barhoom AM, Almasri AR, Abu-Naser SS (2022) Prediction of instructor performance using machine and deep learning techniques. Int J Adv Comput Sci Appl 13(7). https://doi.org/10.14569/IJACSA.2022.0130711

  49. Chiche A, Yitagesu B (2022) Part of speech tagging: a systematic review of deep learning and machine learning approaches. J Big Data 9(1):1–25

    Article  Google Scholar 

  50. Cahuantzi R, Chen X, Güttel S (2023) A comparison of LSTM and GRU networks for learning symbolic sequences. In: Arai K (ed) Intelligent computing. SAI 2023. Lecture notes in networks and systems, vol 739. Springer, Cham, pp 771–785. https://doi.org/10.1007/978-3-031-37963-5_53

    Chapter  Google Scholar 

  51. Al-Dulaimi HW, Aldhahab A, Al Abboodi HM (2023) Speaker identification system employing multi-resolution analysis in conjunction with CNN. Int J Intell Eng Syst 16(5):350–361

    Google Scholar 

  52. Sefara TJ, Mokgonyane TB (2020) Emotional speaker recognition based on machine and deep learning. In: 2020 2nd international multidisciplinary information technology and engineering conference (IMITEC), IEEE, pp 1–8

  53. Al Hindawi NA, Shahin I, Nassif AB (2021) Speaker identification for disguised voices based on modified SVM classifier. In: 2021 18th international multi-conference on systems, signals & devices (SSD), IEEE, pp 687–691

  54. Pereira DG, Afonso A, Medeiros FM (2015) Overview of Friedman’s test and post-hoc analysis. Commun Stat Simul Comput 44(10):2636–2653

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Bou Nassif.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Consent for publication

Not applicable.

Informed consent

This study does not involve any experiments on animals.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nassif, A.B., Shahin, I., Bader, M. et al. ViT-LSTM synergy: a multi-feature approach for speaker identification and mask detection. Neural Comput & Applic 36, 22569–22586 (2024). https://doi.org/10.1007/s00521-024-10389-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s00521-024-10389-7

Keywords

Profiles

  1. Ali Bou Nassif