Abstract
In the realm of speaker identification, pitch frequency serves as a fundamental feature. However, this feature can be compromised when a speaker records his speech in a closed room, resulting in distorted signal features. This distortion not only reduces the effectiveness of speaker identification systems, but also opens the door for potential deception by hackers who exploit the reverberation effects in closed rooms. To address this concern, the correction of estimated pitch frequencies emerges as an essential step for the success of speaker identification systems. This paper presents a Hybrid Approach for Estimating Pitch Frequency (HAEPF) that integrates both the Zero Crossing Rate (ZCR) and Auto-Correlation Function (ACF) methods. Furthermore, the paper delves into the modeling of reverberant speech using comb filtering, shedding light on how multiple reflections impact the accuracy of pitch frequency estimation. Several simulation experiments were conducted to assess pitch frequency estimation for speech signals, both in the presence and absence of reverberation. The estimation errors were calculated for all three scenarios of reverberation (mild, moderate, and severe). The results clearly indicate that as the degree of reverberation, characterized by the comb filter order, increases, the pitch frequency estimation error also increases. The estimation accuracy of the proposed approach is calculated in terms of Pitch Frequency Estimation Error (PFEE), Gross Pitch Error (GPE) and Octave Error (OER) and is compared with those of several established pitch frequency estimation methods. The proposed approach exhibits a notable enhancement even in noisy environments, reducing PFEE by 43%, and achieving GPE and OER of less than 0.3 and 0.12, respectively, at a Signal-to-Noise Ratio (SNR) of 0 dB.













Similar content being viewed by others
Data availability
Not applicable.
References
Küsel ET, Siderius M (2019) Comparison of propagation models for the characterization of sound pressure fields. IEEE J Oceanic Eng 44(3):598–610. https://doi.org/10.1109/JOE.2018.2884107
Hu Y, Tang J, Zhou H (2018) "A method of sound propagation loss calculation based on Gaussian beams," 2018 10th International Conference on Wireless Communications and Signal Processing (WCSP), Hangzhou, China, pp. 1–4. https://doi.org/10.1109/WCSP.2018.8555716
Zhang L, Li XY, Meng CX (2020) "Modeling of high frequency sound propagation characteristics in Shallow Sea," 2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC), Macau, China, pp. 1–4. https://doi.org/10.1109/ICSPCC50002.2020.9259498
Zhou J, Zhang L, He W, Zheng L (2022) "Parameter analysis affecting the characteristics of sound insulation of gradient U-shaped groove structure," 2022 4th International Conference on Intelligent Control, Measurement and Signal Processing (ICMSP), Hangzhou, China, pp. 130–133. https://doi.org/10.1109/ICMSP55950.2022.9859054
Liu Z, Li Y, Huang R (2021) "Analysis of vibration and sound field evaluation and simulation method of main sound source equipment in substation," 2021 IEEE 4th International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, pp. 220–223. https://doi.org/10.1109/ICISCAE52414.2021.9590767
Lou W, Jin Z, Zhang C, Hou A, Wang W, Ding L (2023) "Analysis of primary frequency response based on overspeed and pitch control reserve and coordinated control strategy," 2023 IEEE International Conference on Power Science and Technology (ICPST), Kunming, China, pp. 193–198. https://doi.org/10.1109/ICPST56889.2023.10164944
Peng F, McKay CM, Mao D, Hou W, Innes-Brown H (2019) "Cortical pitch response components correlate with the pitch salience of resolved and unresolved components of Mandarin tones," 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, pp. 4682-4685.https://doi.org/10.1109/EMBC.2019.8856565
Lin S (2019) Robust pitch estimation and tracking for speakers based on subband encoding and the generalized labeled multi-bernoulli filter. IEEE/ACM Trans Audio, Speech, Lang Process 27(4):827–841. https://doi.org/10.1109/TASLP.2019.2898818
Wei W, Li P, Yu Y, Li W (2022) "HarmoF0: Logarithmic scale dilated convolution for pitch estimation," 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, pp. 1–6. https://doi.org/10.1109/ICME52920.2022.9858935
Lai JJ, Townsend J (2022) "Developing a noise canceling device for ranged sound suppression," 2022 IEEE Integrated STEM Education Conference (ISEC), Princeton, NJ, USA, pp. 413-413.https://doi.org/10.1109/ISEC54952.2022.10025054
Azarov E, Vashkevich M, Petrovsky A (2012) "Instantaneous pitch estimation based on RAPT framework," 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, pp. 2787-2791
De Cheveigné A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and music. J Acoust Soc Am 111(4):1917–1930
Mauch M, Dixon S (2014) PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy. pp 659–663. https://doi.org/10.1109/ICASSP.2014.6853678
Nakai T, Rachman L, Arias Sarah P, Okanoya K, Aucouturier JJ (2023) Algorithmic voice transformations reveal the phonological basis of language-familiarity effects in cross-cultural emotion judgments. PLoS One 18(5):e0285028. https://doi.org/10.1371/journal.pone.0285028
Kopf LM, Jackson-Menaldi C, Rubin AD, Skeffington J, Hunter EJ, Skowronski MD, Shrivastav R (2017) Pitch strength as an outcome measure for treatment of dysphonia. J Voice 31(6):691–696. https://doi.org/10.1016/j.jvoice.2017.01.016
Guglani J, Mishra AN (2020) Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit. Appl Acoust 167:107386
Xu S, Shimodaira H (2019) Direct F0 estimation with neural-network-based regression. Interspeech 1995–1999. https://api.semanticscholar.org/CorpusID:202714159
Kim JW, Salamon J, Li P, Bello JP (2018) Crepe: A Convolutional Representation for Pitch Estimation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada. pp 161–165. https://doi.org/10.1109/ICASSP.2018.8461329
Dong M, Wu J, Luan J (2019) Vocal pitch extraction in polyphonic music using convolutional residual network. In: 20th Annual Conference of the International Speech Communication Association. pp 2010–2014. http://dx.doi.org/10.21437/Interspeech.2019-2286
Hung YC, Chen P-H, Ding J-J (2023) "Pitch estimation by denoising preprocessor and hybrid estimation model," 2023 International Conference on Consumer Electronics - Taiwan (ICCE-Taiwan), PingTung, Taiwan, pp. 781–782. https://doi.org/10.1109/ICCE-Taiwan58799.2023.10226907
Khadem-hosseini M, Ghaemmaghami S, Abtahi A, Gazor S, Marvasti F (2020) Error correction in pitch detection using a deep learning based classification. IEEE/ACM Trans Audio, Speech, Lang Process 28:990–999. https://doi.org/10.1109/TASLP.2020.2977472
Chhetri AR, Kumar K, Muthyala MP, Shreyas MR, Bangalore RA (2023) "Carnatic music identification of Melakarta ragas through machine and deep learning using audio signal processing," 2023 4th International Conference for Emerging Technology (INCET), Belgaum, India, pp. 1-5.https://doi.org/10.1109/INCET57972.2023.10170568
Zhang C, et al (2021) "Denoispeech: denoising text to speech with frame-level noise modeling," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, pp. 7063–7067. https://doi.org/10.1109/ICASSP39728.2021.9413934
Nayem KM, Williamson DS (2021) "Towards An ASR approach using acoustic and language models for speech enhancement," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, pp. 7123–7127. https://doi.org/10.1109/ICASSP39728.2021.9414565
Black D, Rapos EJ, Stephan M (2019) "Voice-driven modeling: software modeling using automated speech recognition," 2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), Munich, Germany, pp. 252–258. https://doi.org/10.1109/MODELS-C.2019.00040
Pal S (2012) Speech signal processing: non-linear energy operator centric review. Int J Electron Eng Res 4(3):205–221
Abd El-Samie FE (2011) Information security for automatic speaker identification. Springer, Berlin, Germany, pp 1–122
Shuvo S, et al (2020) "Analog signal processing based hardware implementation of real-time audio visualizer," 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, pp. 1852-1856.https://doi.org/10.1109/TENSYMP50017.2020.9230976
Shahnaz C, Zhu W-P, Ahmad MO (2012) Pitch estimation based on a harmonic sinusoidal autocorrelation model and a time-domain matching scheme. IEEE Trans Audio Speech Lang Process 20(1):322–335. https://doi.org/10.1109/TASL.2011.2161579
Hosoda Y, Kawamura A, Iiguni Y (2023) Complex-domain pitch estimation algorithm for narrowband speech signals. IEEE/ACM Trans Audio, Speech, Lang Process 31:2067–2078. https://doi.org/10.1109/TASLP.2023.3278488
Hosoda Y, Kawamura A, Iiguni Y (2021) Pitch estimation algorithm for narrowband speech signal using phase differences between harmonics. In: 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan. pp 920–925
Chen G-F, Wu Y-D (2019) "Segmentation of singing, speech and instruments in Kunqu audio based on zero-crossing rate," 2019 12th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, pp. 270-273.https://doi.org/10.1109/ISCID.2019.00069
Pratibha K, Chandrashekar HM (2017) "Estimation and tracking of pitch for noisy speech signals using EMD based autocorrelation function algorithm," 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), Bangalore, India, pp. 2071–2075. https://doi.org/10.1109/RTEICT.2017.8256964
Bachu RG, Kopparthi S, Adapa B, Barkana BD (2008) Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. Am Soc Eng Educ (ASEE) Zone Conf Proc 1–7
Xu X, Zhang T-Q, Shi S, Zhang Y-J (2014) An improved pitch detection of speech combined with speech enhancement. In: 2014 7th International Congress on Image and Signal Processing, Dalian, China,. pp 778–782. https://doi.org/10.1109/CISP.2014.7003882
Vijay K, Krithiga P, Kavirakesh S (2023) "Pitch extraction and notes generation implementation using tensor flow," 2023 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, pp. 1–8. https://doi.org/10.1109/ICCCI56745.2023.10128544
https://www.magicdatatech.com/datasets/tts/mdt-tts-e011-mandarin-chinese-speech-corpus-for-tts-1611045140. Last access (13 Dec. 2023)
https://svr-www.eng.cam.ac.uk/comp.speech/Section1/Data/noisex.html. Last access (13 Dec. 2023)
Wang H, Yue W, Wen S, Xu X, Haasis HD, Su M et al (2022) An improved bearing fault detection strategy based on artificial bee colony algorithm. CAAI Trans Intell Technol 7:570–581
Ksibi A, Hakami NA, Alturki N, Zakariah M, Ayadi M (2023) Voice pathology detection using a two-level classifier based on combined cnn–rnn architecture . Sustainability 15(4):3204. https://doi.org/10.3390/su15043204
Shrikant M, Kumar P, Namasudra S, Tiwary US (2022) Experience replay-based deep reinforcement learning for dialogue management optimisation. ACM Trans Asian Low-Resour Lang Inf Process. https://doi.org/10.1145/3539223
Albakri A, Alabdullah B, Alhayan F (2023) Blockchain-assisted machine learning with hybrid metaheuristics-empowered cyber attack detection and classification model. Sustainability 15:13887. https://doi.org/10.3390/su151813887
Ayoub S, Gulzar Y, Rustamov J, Jabbari A, Reegu FA, Turaev S (2023) Adversarial approaches to tackle imbalanced data in machine learning. Sustainability 15(9):7097. https://doi.org/10.3390/su15097097
Zheng M, Zhi K, Zeng J, Tian C, You L (2022) A hybrid CNN for image denoising. J Artif Intell Technol 2(3):93–99. https://doi.org/10.37965/jait.2022.0101
Manjari K, Verma M, Singal G, Namasudra S (2023) QEST: quantized and efficient scene text detector using deep learning. ACM Trans Asian Low-Resour Lang Inf Process 22(5):18. https://doi.org/10.1145/3526217
Acknowledgements
The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project number ISP23-56.
Funding
This research was funded by Deputyship for Research& Innovation, Ministry of Education in Saudi Arabia, grant number ISP23-56.
Author information
Authors and Affiliations
Contributions
All authors contributed in writing and reviewing this paper.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare no conflict of interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hassan, E.S., Neyazi, B., Seddeq, H.S. et al. HAEPF: hybrid approach for estimating pitch frequency in the presence of reverberation. Multimed Tools Appl 83, 77489–77508 (2024). https://doi.org/10.1007/s11042-024-18231-x
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11042-024-18231-x


