Two-layered audio-visual integration in voice activity detection and automatic speech recognition for robots

Yoshida, Takami; Nakadai, Kazuhiro

doi:10.21437/Interspeech.2010-716

Automatic Speech Recognition (ASR) which plays an important role in human-robot interaction should be noise-robust because robots are expected to work in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve robustness in such environments. This paper proposes two-layered AV integration for an ASR system which applies AV integration to Voice Activity Detection (VAD) and ASR decoding processes. We implement a prototype ASR system based on the proposed two-layered AV integration and evaluated the system in dynamically-changing situations where audio and/or visual information can be noisy or missing. Preliminary results showed that the proposed method improves the robustness of ASR system even in auditory- or visually-contaminated situations.

Two-layered audio-visual integration in voice activity detection and automatic speech recognition for robots

Takami Yoshida, Kazuhiro Nakadai