MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

ICCV 2025

News

[June 26th, 2025] 🔥MPG-SAM 2 has been accepted by ICCV 2025! We have open-sourced the code and model.

Notice

[August 9th, 2025] The Figure 2 in both the ICCV camera-ready version and earlier arXiv versions contains errors. Please refer to the latest version on arXiv for the correct diagram.

Framework

Model

Our model can be obtained from this Hugging Face link MPG-SAM 2 · HF. The tiny model uses BEiT-3-base, while the other models use BEiT-3-large. "ytvos" refers to the model trained on the Ref-Youtube-VOS dataset, and "mevis" refers to the model trained on the MeViS dataset.

Installation

git clone https://github.com/rongfu-dsb/MPG-SAM2.git
cd MPG-SAM2
conda create -n MPG-SAM2 python=3.8
conda activate MPG-SAM2
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip install 'git+https://github.com/facebookresearch/fvcore'

Prepare

Datasets

We mainly used the Ref-Youtube-VOS, Ref-DAVIS17, RefCOCO/+/g, and MeViS datasets. For the data preparation process, please refer to ReferFormer and MeViS. Once the data is prepared, replace ./your_data in the code with the path to your dataset.

├── your_data
│   ├── coco
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   ├── refcocog
│   │   └── train2014
│   ├── ref-youtube-vos
│   │   ├── meta_expressions
│   │   ├── train
│   │   └── valid
│   ├── ref-davis
│   │   ├── DAVIS
│   │   ├── davis_text_annotations
│   │   ├── meta_expressions
│   │   ├── train
│   │   └── valid
│   ├── mevis-release
│   │   ├── train
│   │   ├── valid
│   │   └── valid_u

Initial weights

In this link BEiT-3, download the weights for BEiT-3-large/BEiT-3-base and beit3.spm, and then replace --encoder_pretrained and --version accordingly. In this link SAM2, download the weights for sam2_hiera_large and replace --vision_pretrained.

Training & Inference

Ref-Youtube-VOS

bash pre_fine.sh

Ref-DAVIS17

bash ./scripts/dist_test_davis.sh ./workdir_davis ./your_ytvos_weights --ngpu=1

The Ref-DAVIS17 dataset is not used for training; instead, the weights trained on the Ref-Youtube-VOS dataset are directly used for inference.

MeViS

bash finetune_mevis.sh

The inference results from the Ref-Youtube-VOS and MeViS datasets are submitted to servers Ref-Youtube-VOS and MeViS for metric evaluation, respectively.

Acknowledgement

We borrow some code from the following works and would like to express our gratitude to them: SAM2, EVF-SAM, ReferFormer, BEiT-3.

Citation

@article{rong2025mpg,
  title={MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation},
  author={Rong, Fu and Lan, Meng and Zhang, Qian and Zhang, Lefei},
  journal={arXiv preprint arXiv:2501.13667},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
datasets		datasets
davis2017		davis2017
models		models
picture		picture
sam2_configs		sam2_configs
scripts		scripts
tools		tools
util		util
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
eval_davis.py		eval_davis.py
finetune_mevis.sh		finetune_mevis.sh
inference_davis.py		inference_davis.py
inference_mevis.py		inference_mevis.py
inference_ytvos.py		inference_ytvos.py
main.py		main.py
main_pretrain.py		main_pretrain.py
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
opts.py		opts.py
pre_fine.sh		pre_fine.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

ICCV 2025

News

Notice

Contents

Framework

Model

Installation

Prepare

Datasets

Initial weights

Training & Inference

Ref-Youtube-VOS

Ref-DAVIS17

MeViS

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation

ICCV 2025

News

Notice

Contents

Framework

Model

Installation

Prepare

Datasets

Initial weights

Training & Inference

Ref-Youtube-VOS

Ref-DAVIS17

MeViS

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages