- [2025/07/22]
- Since errors in a few clue annotations when converting frame indexes to timestamps, there were errors in the previous benchmark leaderboard, we have reevaluated all models and have updated the new leaderboard.
To train AV-Reasoner, please follow these steps:
-
Install Ola-Omni dependencies
Follow the official guide here: Ola -
Install additional dependencies
After setting up Ola-Omni, install the following dependencies to run the training script:
accelerate>=1.2.1
bitsandbytes>=0.43.0
einops>=0.8.0
datasets>=3.2.0
deepspeed==0.15.4
hf_transfer>=0.1.4
huggingface-hub[cli]>=0.19.2,<1.0
liger_kernel==0.5.2
packaging>=23.0
safetensors>=0.3.3
sentencepiece>=0.1.99
transformers
trl
torch>=2.5.1
pytest
parameterized>=0.9.0
black>=24.4.2
isort>=5.12.0
flake8>=6.0.0
math-verify
wandb>=0.19.1
pillow
If you only want to evaluate CG-AV-Counting, you just need to install the following dependencies:
numpy
scipy
scikit-learn
Pillow
requests
decord
Note
We recommend training on at least 4 A100 (80GB) GPUs, otherwise you may encounter CUDA out-of-memory errors.
cd train
bash train.sh
- Support VLMEvalKit and lmm-evals
- Release Evaluation Scripts for AV-Reasoner on Other Benchmarks
- Our codebase is conducted on Open-R1-Video and Ola
@misc{lu2025avreasoner,
title={AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs},
author={Lidong Lu and Guo Chen and Zhiqi Li and Yicheng Liu and Tong Lu},
year={2025},
eprint={2506.05328},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05328},
}
@misc{chen2024cgbench,
title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding},
author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang},
year={2024},
eprint={2412.12075},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
