Speech starter: noise-robust endpoint detection by using filled pauses

Kitayama, Koji; Goto, Masataka; Itou, Katunobu; Kobayashi, Tetsunori

doi:10.21437/Eurospeech.2003-396

In this paper we propose a speech interface function, called speech starter, that enables noise-robust endpoint (utterance) detection for speech recognition. When current speech recognizers are used in a noisy environment, a typical recognition error is caused by incorrect endpoints because their automatic detection is likely to be disturbed by non-stationary noises. The speech starter function enables a user to specify the beginning of each utterance by uttering a filler with a filled pause, which is used as a trigger to start speech-recognition processes. Since filled pauses can be detected robustly in a noisy environment, practical endpoint detection is achieved. Speech starter also offers the advantage of providing a hands-free speech interface and it is user-friendly because a speaker tends to utter filled pauses (e.g., "er...") at the beginning of utterances when hesitating in human-human communication. Experimental results from a 10-dB-SNR noisy environment show that the recognition error rate with speech starter was lower than with conventional endpoint-detection methods.

Speech starter: noise-robust endpoint detection by using filled pauses

Koji Kitayama, Masataka Goto, Katunobu Itou, Tetsunori Kobayashi