Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

The official implementation of the paper:

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Ruixin Zhang, Jiaqin Fan, Yifan Liao, Qian Qiao, Fanzhang Li

Abstract

Referring Video Object Segmentation (RVOS) aims to segment specific objects in a video according to textual descriptions. We observe that recent RVOS approaches often place excessive emphasis on feature extraction and temporal modeling, while relatively neglecting the design of the segmentation head. In fact, there remains considerable room for improvement in segmentation head design. To address this, we propose a Temporal-Conditional Referring Video Object Segmentation model, which innovatively integrates existing segmentation methods to effectively enhance boundary segmentation capability. Furthermore, our model leverages a text-to-video diffusion model for feature extraction. On top of this, we remove the traditional noise prediction module to avoid the randomness of noise from degrading segmentation accuracy, thereby simplifying the model while improving performance. Finally, to overcome the limited feature extraction capability of the VAE, we design a Temporal Context Mask Refinement (TCMR) module, which significantly improves segmentation quality without introducing complex designs. We evaluate our method on four public RVOS benchmarks, where it consistently achieves state-of-the-art performance.

Setup for R-VOS

The main setup for R-VOS of our code follows Referformer, SgMg, VD-IT.

First, clone the repository locally.

git clone https://github.com/buxiangzhiren/HCD
cd HCD
conda create -n hcd python=3.8

Then, install Pytorch, torchvision and the necessary packages as well as pycocotools. You can choose the CUDA version that corresponds to your device.

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

download weights

pip install huggingface_hub
huggingface-cli download --resume-download ali-vilab/text-to-video-ms-1.7b --local-dir ./weight
huggingface-cli download --resume-download laion/CLIP-ViT-H-14-laion2B-s32B-b79K --local-dir ./weight/clip
huggingface-cli download --resume-download roberta-base --local-dir ./weight/roberta

Finally, compile CUDA operators.

cd models/ops
python setup.py build install
cd ../..

Please refer to data.md for data preparation.

Training and Evaluation

The training and evaluation scripts are included in the scripts folder. Please run the following command:

sh ./scripts/dist_train_ytvos_hcd.sh

sh ./scripts/dist_test_ytvos_hcd.sh

Ref-Youtube-VOS & Ref-DAVIS17

denotes that we run the official codes to get the results.

A2D-Sentences & JHMDB-Sentences

Acknowledgement

This repo is based on ReferFormer, VD-IT and ModelScopeT2V. Thanks for their wonderful works.

Citation

@misc{2508.13584,
Author = {Ruixin Zhang and Jiaqing Fan and Yifan Liao and Qian Qiao and Fanzhang Li},
Title = {Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model},
Year = {2025},
Eprint = {arXiv:2508.13584},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
datasets		datasets
davis2017		davis2017
docs		docs
models		models
scripts		scripts
tools		tools
util		util
weight		weight
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
eval_davis.py		eval_davis.py
inference_davis.py		inference_davis.py
inference_ytvos.py		inference_ytvos.py
main.py		main.py
opts.py		opts.py
requirements.txt		requirements.txt
utils_inf.py		utils_inf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Abstract

Setup for R-VOS

Training and Evaluation

Ref-Youtube-VOS & Ref-DAVIS17

A2D-Sentences & JHMDB-Sentences

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model

Abstract

Setup for R-VOS

Training and Evaluation

Ref-Youtube-VOS & Ref-DAVIS17

A2D-Sentences & JHMDB-Sentences

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages