What is MultiVox?
We explore multimodal voice assistant evaluation: assessing how well omni models integrate spoken dialogues with visual cues for context-aware responses.
We introduce MultiVox, a comprehensive benchmark of 1000 human-annotated speech dialogues paired with diverse visual content. Unlike existing vision-centric benchmarks, MultiVox evaluates models' ability to understand fine-grained paralinguistic features, environmental acoustic context and speaker persona along with visual signals.
Our evaluation reveals a significant gap: while humans excel at these multimodal tasks, current state-of-the-art models consistently struggle to produce contextually grounded responses that integrate both spoken and visual cues for truly omni-modal understanding.