Skip to content

AV-Reasoner/AV-Reasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

hf_checkpoint hf_data arXiv Webpage

📰 Updates

  • [2025/07/22]
    • Since errors in a few clue annotations when converting frame indexes to timestamps, there were errors in the previous benchmark leaderboard, we have reevaluated all models and have updated the new leaderboard.

🛠️ Requirements and Installation

Training AV-Reasoner

To train AV-Reasoner, please follow these steps:

  1. Install Ola-Omni dependencies
    Follow the official guide here: Ola

  2. Install additional dependencies
    After setting up Ola-Omni, install the following dependencies to run the training script:

accelerate>=1.2.1
bitsandbytes>=0.43.0
einops>=0.8.0
datasets>=3.2.0
deepspeed==0.15.4
hf_transfer>=0.1.4
huggingface-hub[cli]>=0.19.2,<1.0
liger_kernel==0.5.2
packaging>=23.0
safetensors>=0.3.3
sentencepiece>=0.1.99
transformers
trl
torch>=2.5.1
pytest
parameterized>=0.9.0
black>=24.4.2
isort>=5.12.0
flake8>=6.0.0
math-verify
wandb>=0.19.1
pillow

Evaluating CG-AV-Counting

If you only want to evaluate CG-AV-Counting, you just need to install the following dependencies:

numpy
scipy
scikit-learn
Pillow
requests
decord

🗝️ Training with GRPO

Note

We recommend training on at least 4 A100 (80GB) GPUs, otherwise you may encounter CUDA out-of-memory errors.

cd train
bash train.sh

🛠️ TODO List

  • Support VLMEvalKit and lmm-evals
  • Release Evaluation Scripts for AV-Reasoner on Other Benchmarks

🥰 Acknowledgement

📑 Citation

@misc{lu2025avreasoner,
    title={AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs}, 
    author={Lidong Lu and Guo Chen and Zhiqi Li and Yicheng Liu and Tong Lu},
    year={2025},
    eprint={2506.05328},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.05328}, 
}
@misc{chen2024cgbench,
    title={CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding}, 
    author={Guo Chen and Yicheng Liu and Yifei Huang and Yuping He and Baoqi Pei and Jilan Xu and Yali Wang and Tong Lu and Limin Wang},
    year={2024},
    eprint={2412.12075},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors