Skip to main content
Log in

Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

One of the main challenges in the process of recognizing emotion through the voice are related to the specific characteristics of an individual’s sound spectrum, such as accent and speech rhythm, as well as regionalism and wide variability of spoken phrases. Despite efforts to propose emotion recognition models, providing an increase in accuracy in classifying emotion in a specialized way is an open research question. Faced with these challenges, this work proposes DEEP (DEtection of voice Emotion in Portuguese language), an architecture for detecting voice emotion based on patterns present in the sound spectrum generated by the voice of Brazilian Portuguese speakers. DEEP recognizes each emotion by using a set of specialist Convolutional Neural Networks that receive as input the features extracted from the sound spectrum. With this, DEEP aims to specialize each emotion to increase the rate of correct answers and adapt to different tones and voice conditions that may occur in everyday life. Our results show that DEEP outperforms the emotion recognition measures of other state of art techniques for all evaluated scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from €37.37 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Fig. 1
The alternative text for this image may have been generated using AI.
Fig. 2
The alternative text for this image may have been generated using AI.
Fig. 3
The alternative text for this image may have been generated using AI.
Fig. 4
The alternative text for this image may have been generated using AI.
Fig. 5
The alternative text for this image may have been generated using AI.
Fig. 6
The alternative text for this image may have been generated using AI.
Fig. 7
The alternative text for this image may have been generated using AI.
Fig. 8
The alternative text for this image may have been generated using AI.

Similar content being viewed by others

Data availability

Not applicable.

Code availability

Not applicable.

Notes

  1. Phonic properties of the speech string that contribute to the interpretation of the meaning and determine sentence rhythm.

  2. https://catalog.ldc.upenn.edu/LDC99S78.

  3. http://www-gth.die.upm.es/research/documentation/AI-76Emo-02.pdf.

  4. https://www.researchgate.net/publication/322876066_NNIME_The_NTHU-NTUA_Chinese_interactive_multimodal_emotion_corpus.

  5. http://emodb.bilderbar.info/start.html.

  6. https://sail.usc.edu/iemocap/.

  7. https://keras.io/.

  8. Hyperas: https://github.com/maxpumperla/hyperas.

  9. Keras: https://github.com/keras-team/keras.

References

  1. Abdel-Hamid O, Ar Mohamed, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Lang Process 22(10):1533–1545

    Article  Google Scholar 

  2. Alluhaidan AS, Saidani O, Jahangir R, Nauman MA, Neffati OS (2023) Speech emotion recognition through hybrid features and convolutional neural network. Appl Sci 13(8):4750

    Article  Google Scholar 

  3. Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614

    Article  Google Scholar 

  4. Barrón Estrada ML, Zatarain Cabada R, Oramas Bustillos R, Graff M (2020) Opinion mining and emotion recognition applied to learning environments. Expert Syst Appl 150:113265. https://doi.org/10.1016/j.eswa.2020.113265

    Article  Google Scholar 

  5. Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. Advances in neural information processing systems 2546–2554

  6. Bojanić M, Delić V, Karpov A (2020) Call redistribution for a call center based on speech emotion recognition. Appl Sci. https://doi.org/10.3390/app10134653

    Article  Google Scholar 

  7. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Res Eval 42:335–359

    Article  Google Scholar 

  8. Chen Z, Li J, Liu H, Wang X, Wang H, Zheng Q (2023) Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst Appl 214:118943

    Article  Google Scholar 

  9. Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: Proceeding of fourth international conference on spoken language processing. ICSLP’96, vol. 3, pp. 1970–1973. IEEE

  10. El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit 44(3):572–587

    Article  Google Scholar 

  11. Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462

  12. Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68

    Article  Google Scholar 

  13. Gonçalves VP, Costa EP, Valejo A, Geraldo Filho P, Johnson TM, Pessin G, Ueyama J (2017) Enhancing intelligence in multimodal emotion assessments. Appl Intell 46(2):470–486

    Article  Google Scholar 

  14. Gonçalves VP, Giancristofaro GT, Geraldo Filho P, Johnson T, Carvalho V, Pessin G, de Almeida Neris VP, Ueyama J (2017) Assessing users’ emotion at interaction time: a multimodal approach with multiple sensors. Soft Comput 21(18):5309–5323

    Article  Google Scholar 

  15. Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT press, Cambridge

    Google Scholar 

  16. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association

  17. Hashem A, Arif M, Alghamdi M (2023) Speech emotion recognition approaches: A systematic review. Speech Commun 154:102974

    Article  Google Scholar 

  18. Ho NH, Yang HJ, Kim SH, Lee G (2020) Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8:61672–61686

    Article  Google Scholar 

  19. Huang KY, Wu CH, Hong QB, Su MH, Chen YH (2019) Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5866–5870. IEEE

  20. Juang BH, Rabiner LR (1991) Hidden markov models for speech recognition. Technometrics 33(3):251–272

    Article  MathSciNet  Google Scholar 

  21. Khurana S, Dev A, Bansal P (2024) Adam optimised human speech emotion recogniser based on statistical information distribution of chroma, mfcc, and mbse features. Multimedia Tools and Applications pp. 1–18

  22. Kleinginna PR, Kleinginna AM (1981) A categorized list of emotion definitions, with suggestions for a consensual definition. Motiv Emot 5(4):345–379

    Article  Google Scholar 

  23. Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In: Eighth European Conference on Speech Communication and Technology

  24. Kwon S et al (2020) A cnn-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183

    Google Scholar 

  25. Li C, Bian N, Zhao Z, Wang H, Schuller BW (2024) Multi-view domain-adaptive representation learning for eeg-based emotion recognition. Inf Fusion 104:102156

    Article  Google Scholar 

  26. Li R, Wang Y, Zheng WL, Lu BL (2022) A multi-view spectral-spatial-temporal masked autoencoder for decoding emotions with self-supervised learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6–14

  27. Liu M (2022) English speech emotion recognition method based on speech recognition. Int J Speech Tech 25(2):391–398

    Article  MathSciNet  Google Scholar 

  28. Mano LY, Faiçal BS, Gonçalves VP, Pessin G, Gomes PH, de Carvalho AC, Ueyama J (2019) An intelligent and generic approach for detecting human emotions: a case study with facial expressions. Soft Comput 24:1–13

    Google Scholar 

  29. Pan Y, Shen P, Shen L (2012) Speech emotion recognition using support vector machine. Int J Smart Home 6(2):101–108

    Google Scholar 

  30. Picard RW (2000) Affective computing

  31. Purington A, Taft JG, Sannon S, Bazarova NN, Taylor SH (2017) " alexa is my new bff" social roles, user satisfaction, and personification of the amazon echo. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2853–2859

  32. Russell JA (1980) A circumplex model of affect. J Pers Soc Psych 39(6):1161

    Article  Google Scholar 

  33. Shannon RV, Zeng FG, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270(5234):303–304

    Article  Google Scholar 

  34. Torres Neto J, Filho G, Mano L, Ueyama J (2018) Verbo: Voice emotion recognition database in portuguese language. J Comput Sci 14:1420–1430. https://doi.org/10.3844/jcssp.2018.1420.1430

    Article  Google Scholar 

  35. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d and 2d cnn lstm networks. Biomed Signal Process Control 47:312–323

    Article  Google Scholar 

  36. Zhao Z, Li Q, Zhang Z, Cummins N, Wang H, Tao J, Schuller BW (2021) Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition. Neural Netw 141:52–60

    Article  Google Scholar 

Download references

Acknowledgements

This research was funded by the FAPESP (São Paulo Research Foundation) grant #2021/06210-3.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geraldo P. Rocha Filho.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Filho, G.P.R., Meneguette, R.I., Mendonça, F.L.L.d. et al. Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers. Neural Comput & Applic 36, 19939–19950 (2024). https://doi.org/10.1007/s00521-024-10249-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Version of record:

  • Issue date:

  • DOI: https://doi.org/10.1007/s00521-024-10249-4

Keywords

Profiles

  1. Geraldo P. Rocha Filho
  2. Rodolfo I. Meneguette
  3. Liriam Enamoto