This is the official PyTorch implementation of our paper:
Find First, Track Next: Decoupling Identification and Propagation in Referring Video Object Segmentation, ICCVW 2025
Suhwan Cho*, Seunghoon Lee*, Minhyeok Lee, Jungho Lee, Sangyoun Lee
Link: [ICCVW] [arXiv]
You can also explore other related works at awesome-video-object segmentation.
demo.mp4
Existing referring VOS methods typically fuse visual and textual features in a highly entangled manner, processing multi-modal information jointly. However, this entanglement often leads to challenges in resolving ambiguous target identification and maintaining consistent mask propagation across frames. To address these issues, we propose a decoupled framework that explicitly separates object identification from mask propagation. The key frame is adaptively selected based on segmentation confidence and vision-text alignment, establishing a reliable anchor for propagation.
1. Download the datasets: Ref-YouTube-VOS, Ref-DAVIS17, MeViS.
2. Download Alpha-CLIP weights and place it in the weights/ directory.
FindTrack works well in a training-free manner, but fine-tuning on specific datasets can improve performance further.
For Ref-YouTube-VOS dataset:
deepspeed --num_gpus 4 train_ytvos.py
For MeViS dataset:
deepspeed --num_gpus 4 train_mevis.py
For Ref-YouTube-VOS dataset:
python run_ytvos.py
For MeViS dataset:
python run_mevis.py
Verify the following before running:
✅ Testing dataset selection
✅ GPU availability and configuration
✅ Pre-trained model path
You can use the web demo with your own video!
Run the Gradio demo with:
python demo.py
Code and models are only available for non-commercial research purposes.
For questions or inquiries, feel free to contact:
E-mail: suhwanx@gmail.com