Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers

Filho, Geraldo P. Rocha; Meneguette, Rodolfo I.; Mendonça, Fábio Lúcio Lopes de; Enamoto, Liriam; Pessin, Gustavo; Gonçalves, Vinícius P.

doi:10.1007/s00521-024-10249-4

Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers

Original Article
Published: 09 August 2024

Volume 36, pages 19939–19950, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Geraldo P. Rocha Filho ORCID: orcid.org/0000-0001-6795-2768¹,
Rodolfo I. Meneguette²,
Fábio Lúcio Lopes de Mendonça³,
Liriam Enamoto³,
Gustavo Pessin⁴ &
…
Vinícius P. Gonçalves³

311 Accesses
6 Citations
Explore all metrics

Abstract

One of the main challenges in the process of recognizing emotion through the voice are related to the specific characteristics of an individual’s sound spectrum, such as accent and speech rhythm, as well as regionalism and wide variability of spoken phrases. Despite efforts to propose emotion recognition models, providing an increase in accuracy in classifying emotion in a specialized way is an open research question. Faced with these challenges, this work proposes DEEP (DEtection of voice Emotion in Portuguese language), an architecture for detecting voice emotion based on patterns present in the sound spectrum generated by the voice of Brazilian Portuguese speakers. DEEP recognizes each emotion by using a set of specialist Convolutional Neural Networks that receive as input the features extracted from the sound spectrum. With this, DEEP aims to specialize each emotion to increase the rate of correct answers and adapt to different tones and voice conditions that may occur in everyday life. Our results show that DEEP outperforms the emotion recognition measures of other state of art techniques for all evaluated scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

An Entropy-Based Computational Classifier for Positive and Negative Emotions in Voice Signals

Emotion Recognition Through Speech

Combining Deep and Unsupervised Features for Multilingual Speech Emotion Recognition

Data availability

Not applicable.

Code availability

Not applicable.

Notes

Phonic properties of the speech string that contribute to the interpretation of the meaning and determine sentence rhythm.
https://catalog.ldc.upenn.edu/LDC99S78.
http://www-gth.die.upm.es/research/documentation/AI-76Emo-02.pdf.
https://www.researchgate.net/publication/322876066_NNIME_The_NTHU-NTUA_Chinese_interactive_multimodal_emotion_corpus.
http://emodb.bilderbar.info/start.html.
https://sail.usc.edu/iemocap/.
https://keras.io/.
Hyperas: https://github.com/maxpumperla/hyperas.
Keras: https://github.com/keras-team/keras.

References

Abdel-Hamid O, Ar Mohamed, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio, Speech, Lang Process 22(10):1533–1545
Article Google Scholar
Alluhaidan AS, Saidani O, Jahangir R, Nauman MA, Neffati OS (2023) Speech emotion recognition through hybrid features and convolutional neural network. Appl Sci 13(8):4750
Article Google Scholar
Banse R, Scherer KR (1996) Acoustic profiles in vocal emotion expression. J Pers Soc Psychol 70(3):614
Article Google Scholar
Barrón Estrada ML, Zatarain Cabada R, Oramas Bustillos R, Graff M (2020) Opinion mining and emotion recognition applied to learning environments. Expert Syst Appl 150:113265. https://doi.org/10.1016/j.eswa.2020.113265
Article Google Scholar
Bergstra JS, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. Advances in neural information processing systems 2546–2554
Bojanić M, Delić V, Karpov A (2020) Call redistribution for a call center based on speech emotion recognition. Appl Sci. https://doi.org/10.3390/app10134653
Article Google Scholar
Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Chang JN, Lee S, Narayanan SS (2008) Iemocap: Interactive emotional dyadic motion capture database. Lang Res Eval 42:335–359
Article Google Scholar
Chen Z, Li J, Liu H, Wang X, Wang H, Zheng Q (2023) Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Syst Appl 214:118943
Article Google Scholar
Dellaert F, Polzin T, Waibel A (1996) Recognizing emotion in speech. In: Proceeding of fourth international conference on spoken language processing. ICSLP’96, vol. 3, pp. 1970–1973. IEEE
El Ayadi M, Kamel MS, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognit 44(3):572–587
Article Google Scholar
Eyben F, Wöllmer M, Schuller B (2010) Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM international conference on Multimedia, pp. 1459–1462
Fayek HM, Lech M, Cavedon L (2017) Evaluating deep learning architectures for speech emotion recognition. Neural Netw 92:60–68
Article Google Scholar
Gonçalves VP, Costa EP, Valejo A, Geraldo Filho P, Johnson TM, Pessin G, Ueyama J (2017) Enhancing intelligence in multimodal emotion assessments. Appl Intell 46(2):470–486
Article Google Scholar
Gonçalves VP, Giancristofaro GT, Geraldo Filho P, Johnson T, Carvalho V, Pessin G, de Almeida Neris VP, Ueyama J (2017) Assessing users’ emotion at interaction time: a multimodal approach with multiple sensors. Soft Comput 21(18):5309–5323
Article Google Scholar
Goodfellow I, Bengio Y, Courville A, Bengio Y (2016) Deep learning. MIT press, Cambridge
Google Scholar
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth annual conference of the international speech communication association
Hashem A, Arif M, Alghamdi M (2023) Speech emotion recognition approaches: A systematic review. Speech Commun 154:102974
Article Google Scholar
Ho NH, Yang HJ, Kim SH, Lee G (2020) Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network. IEEE Access 8:61672–61686
Article Google Scholar
Huang KY, Wu CH, Hong QB, Su MH, Chen YH (2019) Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5866–5870. IEEE
Juang BH, Rabiner LR (1991) Hidden markov models for speech recognition. Technometrics 33(3):251–272
Article MathSciNet Google Scholar
Khurana S, Dev A, Bansal P (2024) Adam optimised human speech emotion recogniser based on statistical information distribution of chroma, mfcc, and mbse features. Multimedia Tools and Applications pp. 1–18
Kleinginna PR, Kleinginna AM (1981) A categorized list of emotion definitions, with suggestions for a consensual definition. Motiv Emot 5(4):345–379
Article Google Scholar
Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In: Eighth European Conference on Speech Communication and Technology
Kwon S et al (2020) A cnn-assisted enhanced audio signal processing for speech emotion recognition. Sensors 20(1):183
Google Scholar
Li C, Bian N, Zhao Z, Wang H, Schuller BW (2024) Multi-view domain-adaptive representation learning for eeg-based emotion recognition. Inf Fusion 104:102156
Article Google Scholar
Li R, Wang Y, Zheng WL, Lu BL (2022) A multi-view spectral-spatial-temporal masked autoencoder for decoding emotions with self-supervised learning. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 6–14
Liu M (2022) English speech emotion recognition method based on speech recognition. Int J Speech Tech 25(2):391–398
Article MathSciNet Google Scholar
Mano LY, Faiçal BS, Gonçalves VP, Pessin G, Gomes PH, de Carvalho AC, Ueyama J (2019) An intelligent and generic approach for detecting human emotions: a case study with facial expressions. Soft Comput 24:1–13
Google Scholar
Pan Y, Shen P, Shen L (2012) Speech emotion recognition using support vector machine. Int J Smart Home 6(2):101–108
Google Scholar
Picard RW (2000) Affective computing
Purington A, Taft JG, Sannon S, Bazarova NN, Taylor SH (2017) " alexa is my new bff" social roles, user satisfaction, and personification of the amazon echo. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2853–2859
Russell JA (1980) A circumplex model of affect. J Pers Soc Psych 39(6):1161
Article Google Scholar
Shannon RV, Zeng FG, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270(5234):303–304
Article Google Scholar
Torres Neto J, Filho G, Mano L, Ueyama J (2018) Verbo: Voice emotion recognition database in portuguese language. J Comput Sci 14:1420–1430. https://doi.org/10.3844/jcssp.2018.1420.1430
Article Google Scholar
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1d and 2d cnn lstm networks. Biomed Signal Process Control 47:312–323
Article Google Scholar
Zhao Z, Li Q, Zhang Z, Cummins N, Wang H, Tao J, Schuller BW (2021) Combining a parallel 2d cnn with a self-attention dilated residual network for ctc-based discrete speech emotion recognition. Neural Netw 141:52–60
Article Google Scholar

Download references

Acknowledgements

This research was funded by the FAPESP (São Paulo Research Foundation) grant #2021/06210-3.

Funding

Not applicable.

Author information

Authors and Affiliations

State University of Southwest Bahia (UESB), Vitória da Conquista, BA, 45083-900, Brazil
Geraldo P. Rocha Filho
Institute of Mathematical and Computer Sciences, University of São Paulo (USP), São Carlos, SP, 14560-970, Brazil
Rodolfo I. Meneguette
University of Brasília (UnB), Brasilia, DF, Brazil
Fábio Lúcio Lopes de Mendonça, Liriam Enamoto & Vinícius P. Gonçalves
Vale Institute of Technology, Robotics Laboratory, Ouro Preto, MG, Brazil
Gustavo Pessin

Authors

Geraldo P. Rocha Filho
View author publications
Search author on:PubMed Google Scholar
Rodolfo I. Meneguette
View author publications
Search author on:PubMed Google Scholar
Fábio Lúcio Lopes de Mendonça
View author publications
Search author on:PubMed Google Scholar
Liriam Enamoto
View author publications
Search author on:PubMed Google Scholar
Gustavo Pessin
View author publications
Search author on:PubMed Google Scholar
Vinícius P. Gonçalves
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Geraldo P. Rocha Filho.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Filho, G.P.R., Meneguette, R.I., Mendonça, F.L.L.d. et al. Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers. Neural Comput & Applic 36, 19939–19950 (2024). https://doi.org/10.1007/s00521-024-10249-4

Download citation

Received: 04 January 2024
Accepted: 19 June 2024
Published: 09 August 2024
Version of record: 09 August 2024
Issue date: November 2024
DOI: https://doi.org/10.1007/s00521-024-10249-4

Keywords

Profiles

Geraldo P. Rocha Filho View author profile
Rodolfo I. Meneguette View author profile
Liriam Enamoto View author profile

Access this article

Log in via an institution

Subscribe and save

Springer+

from €37.37 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price includes VAT (Netherlands)

Instant access to the full article PDF.

Institutional subscriptions

Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

An Entropy-Based Computational Classifier for Positive and Negative Emotions in Voice Signals

Emotion Recognition Through Speech

Combining Deep and Unsupervised Features for Multilingual Speech Emotion Recognition

Explore related subjects

Data availability

Code availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Profiles

Subscribe and save

Buy Now