Improving detection of acoustic events using audiovisual data and feature level fusion

Butko, T.; Canton-Ferrer, C.; Segura, C.; Giró, X.; Nadeu, C.; Hernando, J.; Casas, J. R.

doi:10.21437/Interspeech.2009-334

The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach.

Improving detection of acoustic events using audiovisual data and feature level fusion

T. Butko, C. Canton-Ferrer, C. Segura, X. Giró, C. Nadeu, J. Hernando, J. R. Casas