Object-centric Video Question Answering with Visual Grounding and Referring

🏡 Project Page | 📄 Paper | 📦 VideoInfer Dataset | 🤗 RGA3 Checkpoints

News

[2025-07] We have released the paper, codes, datasets, and checkpoints.

Environment

First create conda environment according to your CUDA version.

conda create -n rga3 python=3.10.16 -y
conda activate rga3
conda install pytorch==2.5.1 torchvision==0.20.1 pytorch-cuda=12.4 -c pytorch -c nvidia

pip install --upgrade pip  # enable PEP 660 support 
pip install -r requirements.txt

pip install ninja
pip install flash-attn --no-build-isolation

Then you need to install the SAM2 package. In our implementation, the version of core packages are torch==2.5.1+cu124, flash_attn==2.7.4post1.

Then, install the CoTracker3 package. Afterwards, install the following packages.

apt update && apt install openjdk-11-jdk -y && apt install zip

Trouble Shooting: Since we adopt an early version of Qwen2.5-VL (4.49.0.dev0 for HuggingFace), some bfloat16 problems should be manually addressed, according to this issue.

Demo

After downloading checkpoints & installing environments, you can open an interface to inference via Gradio.

python app.py --version /PATH/TO/UniGR-7B

Prepare Datasets

You can check the used training datasets and the corresponding sampling rate in run_torchrun.sh and utils/dataset.py.

For image segmentation datasets, please refer to LISA.
For video segmentation datasets, please refer to VideoLISA & ReVOS.
For region-level image question-answering datasets, please refer to ViP-LLaVA & Osprey.
For region-level video question-answering datasets, you can download from VideoInfer & VideoRefer-Bench.
For general question-answering datasets, you can download from LLaVA & LLaVA-Video.

You should replace the absolute path in the code with the actual saved path on your machine.

VideoInfer Structure

The train/test spliting of VideoInfer follows ReVOS to avoid data leakage between segmentation and question-answering.

VideoInfer-Release
├── frames                        # all images of the train set and test set
├── visual_prompts                # fixed visual prompts for the test set
├── mask_dict.json                # mask dict (train set & test set)
├── train.json                    # QA pairs & masks for generating visual prompts (train set)
└── test.json                     # QA pairs & fixed visual prompts (test set)

Training

Our original training is conducted on 8xH800 (80G) of 2 nodes for about 1 day.

bash run_torchrun.sh

After training, you should merge LoRA weights:

bash merge.sh

Evaluation

You can check the details of each benchmark in the evaluation folder. Before executing the inference and evaluation commands, you may change the codes with the actual dataset paths.

Video Segmentation

For example, when evaluating on MeViS, you should

cd RGA3-release

# Step 1
bash evaluation/mevis_val_u/run_inference_mevis.sh

# Step 2
bash evaluation/mevis_val_u/run_eval_mevis.sh

Trouble Shooting: The inference script we adopted from VideoLISA may skip some samples, so you may need to execute Step 1 more than once before executing Step 2.

VideoRefer-Bench^Q

To evaluate RGA3 on VideoRefer-Bench^Q, execute following command and the calculated accuracy will be printed.

bash evaluation/videorefer_bench/run_inference_videorefer.sh

VideoInfer

To evaluate RGA3 on the VideoInfer test split, you should execute the following commands:

bash evaluation/videoinfer/run_inference_parallel.sh

This step will conduct inference and offline metric calculation, such as BLEU-4, saving predicted answers and ground truth answers. Afterwards, to obtain GPT4 accuracy/score, you can refer to eval_gpt.ipynb, where we implement the evaluation through OpenAI batch inference. However, you can re-implement it while keeping the original prompt and model version according to your API provider.

We also provide the evaluation scripts of several baseline methods in the baselines folder.

Citation

If you find this paper or repo helpful, you can use the following format to cite:

@article{wang2025object,
  title={Object-centric Video Question Answering with Visual Grounding and Referring},
  author={Wang, Haochen and Chen, Qirui and Yan, Cilin and Cai, Jiayin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi and Gavves, Stratis},
  journal={arXiv preprint arXiv:2507.19599},
  year={2025}
}

🫡 Acknowledgements

Our codes are based on LISA & VideoLISA. The copyright for adding language embedding in SAM2 belongs to Sa2VA. The implementation of generating and processing visual prompts is based on ViP-LLaVA.
We also thank the open-source projects like Qwen2.5-VL, CoTracker3 and SAM2.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
evaluation		evaluation
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
merge.sh		merge.sh
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
requirements.txt		requirements.txt
run_torchrun.sh		run_torchrun.sh
train_joint.py		train_joint.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Object-centric Video Question Answering with Visual Grounding and Referring

News

Environment

Demo

Prepare Datasets

VideoInfer Structure

Training

Evaluation

Video Segmentation

VideoRefer-Bench^Q

VideoInfer

Citation

🫡 Acknowledgements

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Object-centric Video Question Answering with Visual Grounding and Referring

News

Environment

Demo

Prepare Datasets

VideoInfer Structure

Training

Evaluation

Video Segmentation

VideoRefer-BenchQ

VideoInfer

Citation

🫡 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

VideoRefer-Bench^Q