Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📢 News

[2026.02.13] 🎨🎨 We updated interesting demos.

[2026.02.05] 🔥🔥 We updated training code.

[2026.01.26] 🎉🎉FlowRVS was accepted by ICLR 2026!

[2025.12.01] 🔥🔥 We updated model weight and inference code.

🏄‍♂️ Overview

FlowRVS replaces the cascaded ‘locate-then-segment’ paradigm (A) with a unified, end-to-end flow (B). This new paradigm avoids information bottlenecks, enabling superior handling of complex language and dynamic video (C) and achieving state-of-the-art performance (D).

✨ Key Features:

FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a video’s spatio-temporal representation into its target mask.
FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
FlowRVS achieves new state of the art (SOTA) results on key benchmarks

🎬 Demos

We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!

automan_result.mp4

🦾 Ultraman

FPS: 12
Prompt: "the Ultraman", "the devil cat"
Note: Handles complex dynamic interactions (combat) and severe environmental interference (heavy smoke/fog). Observe the fine-grained boundary adherence on the cat's fur and the Ultraman's silhouette despite the chaos.

jams_curry_result.mp4

🏀⛹️‍♂️ Basketball

FPS: 12
Prompt: "the man wearing colorful shoes shoots the ball", "the man who is defending", "basketball"
Note: Successfully tracks small, fast-moving objects (the basketball) and articulates complex human motion. It distinguishes the shooter from the defender even during rapid crossover movements.

saul_result.mp4

⚖️ Better Call Saul

FPS: 8
Prompt: "angry man in the suit shouting at another man"
Note: Demonstrates robust long-term temporal consistency. The model maintains identity and accurate segmentation over extended sequences, resisting drift even as the camera zooms and subjects interact.

🐱 Cat Memes Segmentation

the_aggressive_cat_output.mp4

the_dying_cat_output.mp4

Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.

🛠️ Environment Setup

1. Create a conda environment

git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs

2. Install dependencies

pip install -r requirements.txt

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers

🍻 Inference

Inference on MeViS val and val_u splits.

1. Prepare data

The dataset can be found in: https://github.com/henghuiding/MeViS
After you successfully download the dataset, the file structure of the dataset should be like this:

datasets
- MeViS/
  - valid/
    - JPEGImages/
    - meta_expressions.json
  - valid_u/
    - JPEGImages/
    - mask_dict.json
    - meta_expressions.json

pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Just run:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u

Note that this code will cost about 33G GPU memory with default setting.

Inference on any videos.

python inference_demo.py --input_path=video.mp4  --text_prompts "prompt_1" "prompt_2"    --fps=12 --save_fig --output_dir=result  --dit_ckpt=FlowRVS_dit_mevis.pth  --vae_ckpt=tuned_vae.pth

🥂 Training

Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  main.py  --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training

💚 Acknowledgement

We referenced the following works, and appreciate their contributions to the community.

Wan2.1
MeViS

🔗 BibTeX

If you find our FlowRVS useful for your research and applications, please kindly cite us:

@article{wang2025flowrvs,
  title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
  author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
  journal={arXiv preprint arXiv:2510.06139}, 
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
assets		assets
datasets		datasets
models		models
util		util
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
inference_demo.py		inference_demo.py
inference_mevis.py		inference_mevis.py
main.py		main.py
metrics.py		metrics.py
opts.py		opts.py
requirements.txt		requirements.txt
sample.py		sample.py
utils_inf.py		utils_inf.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📢 News

🏄‍♂️ Overview

🎬 Demos

🦾 Ultraman

🏀⛹️‍♂️ Basketball

⚖️ Better Call Saul

🐱 Cat Memes Segmentation

🛠️ Environment Setup

1. Create a conda environment

2. Install dependencies

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

🍻 Inference

Inference on MeViS val and val_u splits.

1. Prepare data

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Inference on any videos.

🥂 Training

💚 Acknowledgement

🔗 BibTeX

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

📢 News

🏄‍♂️ Overview

🎬 Demos

🦾 Ultraman

🏀⛹️‍♂️ Basketball

⚖️ Better Call Saul

🐱 Cat Memes Segmentation

🛠️ Environment Setup

1. Create a conda environment

2. Install dependencies

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

🍻 Inference

Inference on MeViS val and val_u splits.

1. Prepare data

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Inference on any videos.

🥂 Training

💚 Acknowledgement

🔗 BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages