Skip to content

xmz111/FlowRVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

102 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Β  Β  Β  Β 

πŸ“’ News

[2026.02.13] 🎨🎨 We updated interesting demos.

[2026.02.05] πŸ”₯πŸ”₯ We updated training code.

[2026.01.26] πŸŽ‰πŸŽ‰FlowRVS was accepted by ICLR 2026!

[2025.12.01] πŸ”₯πŸ”₯ We updated model weight and inference code.

πŸ„β€β™‚οΈ Overview

Result

FlowRVS replaces the cascaded β€˜locate-then-segment’ paradigm (A) with a unified, end-to-end flow (B). This new paradigm avoids information bottlenecks, enabling superior handling of complex language and dynamic video (C) and achieving state-of-the-art performance (D).

✨ Key Features:

  • FlowRVS reformulates RVOS as learning a continuous, text-conditioned flow that deforms a video’s spatio-temporal representation into its target mask.
  • FlowRVS successfully transfer the powerful text-to-video generative model to this RVOS task by proposing a suite of principled techniques.
  • FlowRVS achieves new state of the art (SOTA) results on key benchmarks

Result

🎬 Demos

We provide weights trained exclusively on the challenging MeViS dataset. Despite not seeing these domains during training, FlowRVS demonstrates remarkable zero-shot generalization across movies, sports, and internet memes. Have fun exploring!

automan_result.mp4

🦾 Ultraman

  • FPS: 12
  • Prompt: "the Ultraman", "the devil cat"
  • Note: Handles complex dynamic interactions (combat) and severe environmental interference (heavy smoke/fog). Observe the fine-grained boundary adherence on the cat's fur and the Ultraman's silhouette despite the chaos.

jams_curry_result.mp4

πŸ€β›ΉοΈβ€β™‚οΈ Basketball

  • FPS: 12
  • Prompt: "the man wearing colorful shoes shoots the ball", "the man who is defending", "basketball"
  • Note: Successfully tracks small, fast-moving objects (the basketball) and articulates complex human motion. It distinguishes the shooter from the defender even during rapid crossover movements.

saul_result.mp4

βš–οΈ Better Call Saul

  • FPS: 8
  • Prompt: "angry man in the suit shouting at another man"
  • Note: Demonstrates robust long-term temporal consistency. The model maintains identity and accurate segmentation over extended sequences, resisting drift even as the camera zooms and subjects interact.

🐱 Cat Memes Segmentation

the_aggressive_cat_output.mp4
the_dying_cat_output.mp4
  • Robustness against severe occlusions (shelf, paper roll, sausage) and significant non-rigid body deformation. The model tracks the target continuously even when partially hidden or undergoing extreme pose changes.

πŸ› οΈ Environment Setup

1. Create a conda environment

git clone https://github.com/xmz111/FlowRVS.git && cd FlowRVS
conda create -n flowrvs python=3.10 -y
conda activate flowrvs

2. Install dependencies

pip install -r requirements.txt

3. Prepare Wan2.1 T2V model, we need config to construct models and T5 Encoder.

pip install "huggingface_hub[cli]"
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B-Diffusers --local-dir ./Wan2.1-T2V-1.3B-Diffusers

🍻 Inference

Inference on MeViS val and val_u splits.

1. Prepare data

The dataset can be found in: https://github.com/henghuiding/MeViS
After you successfully download the dataset, the file structure of the dataset should be like this:

  • datasets
    • MeViS/
      • valid/
        • JPEGImages/
        • meta_expressions.json
      • valid_u/
        • JPEGImages/
        • mask_dict.json
        • meta_expressions.json
pip install gdown
gdown https://drive.google.com/drive/folders/1MACaQ-O8seyMj-MBlycxRgCT08RVBZJp --folder -O dataset/MeViS/

2. Download DiT and tuned VAE checkpoints from https://huggingface.co/xmz111/FlowRVS and place them as mevis_dit.pth and tuned_vae.pth;

3. Inference

Just run:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 inference_mevis.py --dit_ckpt=FlowRVS_dit_mevis.pth --vae_ckpt=tuned_vae.pth --output_dir=result --split=valid_u

Note that this code will cost about 33G GPU memory with default setting.

Inference on any videos.

python inference_demo.py --input_path=video.mp4  --text_prompts "prompt_1" "prompt_2"    --fps=12 --save_fig --output_dir=result  --dit_ckpt=FlowRVS_dit_mevis.pth  --vae_ckpt=tuned_vae.pth

πŸ₯‚ Training

Use --dataset_file to select training dataset (mevis, pretrain, ytvos), and use --resume to load checkpoint.

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2  main.py  --dataset_file=mevis --num_frames=17 --lr=5e-5 --output_dir=mevis_training 

πŸ’š Acknowledgement

We referenced the following works, and appreciate their contributions to the community.

πŸ”— BibTeX

If you find our FlowRVS useful for your research and applications, please kindly cite us:

@article{wang2025flowrvs,
  title={Deforming Videos to Masks: Flow Matching for Referring Video Segmentation},
  author={Wang, Zanyi and Jiang, Dengyang and Li, Liuzhuozheng and Dang, Sizhe and Li, Chengzu and Yang, Harry and Dai, Guang and Wang, Mengmeng and Wang, Jingdong},
  journal={arXiv preprint arXiv:2510.06139}, 
  year={2025}
}

About

[ICLR 2026] Deforming Videos to Masks: Flow Matching for Referring Video Segmentation (FlowRVS)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages