MultiVox is a benchmark to assess how well omni-modal language models can integrate audio and visual cues to give a contextual repsonse
We provide scripts to run Qwen 2.5 Omni using vLLM here
python3 src/baseline_qwen.pyWe use GPT 4.1-mini to run evaluation. You can use the following script to run evaluation
python3 src/evaluate.py