Tianming Liang¹ Haichao Jiang¹ Yuting Yang¹ Chaolei Tan¹ Shuai Li² Wei-Shi Zheng¹ Jian-Fang Hu¹*
¹Sun Yat-sen University ²Shandong University
Long-RVOS is the first large-scale long-term referring video object segmentation benchmark, containing 2,000+ videos with an average duration exceeding 60 seconds.
The Long-RVOS dataset is available on HuggingFace Hub. Use our download script:
python scripts/download_dataset.py \
--repo_id iSEE-Laboratory/Long-RVOS \
--output_dir dataOr manually download from Google Drive and extract:
data/
├── long_rvos/
│ ├── train/
│ │ ├── JPEGImages/
│ │ ├── Annotations/
│ │ └── meta_expressions.json
│ ├── valid/
│ │ ├── JPEGImages/
│ │ ├── Annotations/
│ │ └── meta_expressions.json
│ └── test/
│ ├── JPEGImages/
│ ├── Annotations/
│ └── meta_expressions.json# Clone the repo
git clone https://github.com/iSEE-Laboratory/Long_RVOS.git
cd Long_RVOS
# [Optional] Create a clean Conda environment
conda create -n long_rvos python=3.10 -y
conda activate long_rvos
# PyTorch
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
# MultiScaleDeformableAttention
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..
# Other dependencies
pip install -r requirements.txtReferMo uses SAM2 for mask propagation. Please install SAM2 following the official instructions:
cd sam2
pip install -e .
cd ..Download SAM2 checkpoints and put them in sam2/checkpoints/:
cd sam2/checkpoints
bash download_ckpts.sh
cd ../..Download pretrained GroundingDINO weights and put them in the pretrained directory:
mkdir pretrained
cd pretrained
wget https://github.com/longzw1997/Open-GroundingDino/releases/download/v0.1.0/gdinot-1.8m-odvg.pth # default
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth
If you need to extract motion frames from videos, use:
python scripts/extract_motion.py --data_dir data/long_rvos --output_dir motionsOr you can download our processed motions from Google Drive and extract:
motions/
├── train/
│ ├── motions/
│ └── frame_types.json
├── valid/
│ ├── motions/
│ └── frame_types.json
└── test/
├── motions/
└── frame_types.jsonpython main.py -c configs/lrvos_swint.yaml -rm train -bs 2 -ng 8 --version refermo --epochs 6Note: you can download our checkpoint from refermo_swint.pth and put it in the diretory ckpt.
PYTHONPATH=. python eval/inference_lrvos_with_motion.py \
-ng 8 \
-ckpt ckpt/refermo_swint.pth \
--split valid \
--version refermo📌 The results will be saved at
output/long_rvos/{split}/{version}.📌 We also provide a script
eval/inference_lrvos.pyfor ReferDINO-style inference, which does not use motions.
After inference, evaluate the results:
bash run_eval.sh output/long_rvos/valid/refermo validOur code is built upon ReferDINO, GroundingDINO, and SAM2. We sincerely appreciate these efforts.
If you find our work helpful for your research, please consider citing our paper:
@article{liang2025longrvos,
title={Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation},
author={Liang, Tianming and Jiang, Haichao and Yang, Yuting and Tan, Chaolei and Li, Shuai and Zheng, Wei-Shi and Hu, Jian-Fang},
journal={arXiv preprint arXiv:2505.12702},
year={2025}
}This project is licensed under the MIT License. Please refer to the LICENSE file for details.
