MultiVox

A Benchmark for Evaluating Voice Assistants for Multimodal Interactions

What is MultiVox?

MultiVox Example

We explore multimodal voice assistant evaluation: assessing how well omni models integrate spoken dialogues with visual cues for context-aware responses.

We introduce MultiVox, a comprehensive benchmark of 1000 human-annotated speech dialogues paired with diverse visual content. Unlike existing vision-centric benchmarks, MultiVox evaluates models' ability to understand fine-grained paralinguistic features, environmental acoustic context and speaker persona along with visual signals.

Our evaluation reveals a significant gap: while humans excel at these multimodal tasks, current state-of-the-art models consistently struggle to produce contextually grounded responses that integrate both spoken and visual cues for truly omni-modal understanding.

Examples