Skip to content

xmz111/FlowRVS

Repository files navigation

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

       

📢 News

[2026.02.13] 🎨🎨 We updated interesting demos.

[2026.02.05] 🔥🔥 We updated training code.

[2026.01.26] 🎉🎉FlowRVS was accepted by ICLR 2026!

[2025.12.01] 🔥🔥 We updated model weight and inference code.

🏄‍♂️ Overview

Result

FlowRVS replaces the cascaded ‘locate-then-segment’ paradigm (A) with a unified, end-to-end flow (B). This new paradigm avoids information bottlenecks, enabling superior handling of complex language and dynamic video (C) and achieving state-of-the-art performance (D).

✨ Key Features:

  • FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a video’s spatio-temporal representation into its target mask.
  • FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
  • FlowRVS achieves new state of the art (SOTA) results on key benchmarks

Result

🎬 Demos

We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!

automan_result.mp4

🦾 Ultraman

  • FPS: 12
  • Prompt: "the Ultraman", "the devil cat"
  • Note: Handles complex dynamic interactions (combat) and severe environmental interference (heavy smoke/fog). Observe the fine-grained boundary adherence on the cat's fur and the Ultraman's silhouette despite the chaos.

jams_curry_result.mp4

🏀⛹️‍♂️ Basketball

  • FPS: 12
  • Prompt: "the man wearing colorful shoes shoots the ball", "the man who is defending", "basketball"
  • Note: Successfully tracks small, fast-moving objects (the basketball) and articulates complex human motion. It distinguishes the shooter from the defender even during rapid crossover movements.

saul_result.mp4

⚖️ Better Call Saul

  • FPS: 8
  • Prompt: "angry man in the suit shouting at another man"
  • Note: Demonstrates robust long-term temporal consistency. The model maintains identity and accurate segmentation over extended sequences, resisting drift even as the camera zooms and subjects interact.

🐱 Cat Memes Segmentation

the_aggressive_cat_output.mp4
the_dying_cat_output.mp4
  • Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.

🛠️ Environment Setup

1. Create a conda environment

git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs

2. Install dependencies

pip install -r requirements.txt

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers

🍻 Inference

Inference on MeViS val and val_u splits.

1. Prepare data

The dataset can be found in: https://github.com/henghuiding/MeViS
After you successfully download the dataset, the file structure of the dataset should be like this:

  • datasets
    • MeViS/
      • valid/
        • JPEGImages/
        • meta_expressions.json
      • valid_u/
        • JPEGImages/
        • mask_dict.json
        • meta_expressions.json
pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Just run:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u

Note that this code will cost about 33G GPU memory with default setting.

Inference on any videos.

python inference_demo.py --input_path=video.mp4  --text_prompts "prompt_1" "prompt_2"    --fps=12 --save_fig --output_dir=result  --dit_ckpt=FlowRVS_dit_mevis.pth  --vae_ckpt=tuned_vae.pth

🥂 Training

Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  main.py  --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training 

💚 Acknowledgement

We referenced the following works, and appreciate their contributions to the community.

🔗 BibTeX

If you find our FlowRVS useful for your research and applications, please kindly cite us:

@article{wang2025flowrvs,
  title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
  author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
  journal={arXiv preprint arXiv:2510.06139}, 
  year={2025}
}

About

[ICLR 2026] Deforming Videos to Masks: Flow Matching for Referring Video Segmentation (FlowRVS)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages