[2026.02.13] 🎨🎨 We updated interesting demos.
[2026.02.05] 🔥🔥 We updated training code.
[2026.01.26] 🎉🎉FlowRVS was accepted by ICLR 2026!
[2025.12.01] 🔥🔥 We updated model weight and inference code.
FlowRVS replaces the cascaded ‘locate-then-segment’ paradigm (A) with a unified, end-to-end flow (B). This new paradigm avoids information bottlenecks, enabling superior handling of complex language and dynamic video (C) and achieving state-of-the-art performance (D).✨ Key Features:
- FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a video’s spatio-temporal representation into its target mask.
- FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
- FlowRVS achieves new state of the art (SOTA) results on key benchmarks
We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!
automan_result.mp4 |
|
jams_curry_result.mp4 |
|
saul_result.mp4 |
|
the_aggressive_cat_output.mp4 |
the_dying_cat_output.mp4 |
- Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.
git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs
pip install -r requirements.txt
pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers
The dataset can be found in: https://github.com/henghuiding/MeViS
After you successfully download the dataset, the file structure of the dataset should be like this:
- datasets
- MeViS/
- valid/
- JPEGImages/
- meta_expressions.json
- valid_u/
- JPEGImages/
- mask_dict.json
- meta_expressions.json
- valid/
- MeViS/
pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/
2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;
Just run:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u
Note that this code will cost about 33G GPU memory with default setting.
python inference_demo.py --input_path=video.mp4 --text_prompts "prompt_1" "prompt_2" --fps=12 --save_fig --output_dir=result --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth
Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training
We referenced the following works, and appreciate their contributions to the community.
If you find our FlowRVS useful for your research and applications, please kindly cite us:
@article{wang2025flowrvs,
title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
journal={arXiv preprint arXiv:2510.06139},
year={2025}
}

