SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

🎉 CVPR 2025 Highlight 🎉

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta

Welcome to the official repository for "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation".

In this work, we build upon Segment Anything 2 (SAM2) and make it wiser by infusing natural language understanding and explicit temporal modeling.
🚀 No fine-tuning of SAM2 weights.
🧠 No reliance on external VLMs for multi-modal interaction.
📈 State-of-the-art performance across multiple benchmarks.
💡 Minimal overhead: just 4.9 M additional parameters!

📄 Read our paper on arXiv
🌍 Demo & Project Page

📢 [May 2025] Check out SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation — a unified framework powered by SAM2, supporting points, boxes, scribbles, and masks. No external models, no prompt-specific tweaks. 👉 Checkout SANSA

📢 [June 2025] Try SAMWISE on your own data: we’ve added a simple script to run SAMWISE on videos or images using textual prompts. 👉 Try SAMWISE on Your Own Data.

👀 SAMWISE in Action

SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot — no extra training, just living up to its namesake! 🧙‍♂️✨

samwise_in_action.mp4

📊 Data Preparation

Before running SAMWISE, set up your dataset: refer to data.md for detailed data preparation.
Once organized, the directory structure should look like this:

SAMWISE/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── MeViS/
├── datasets/
├── models/
│   ├── sam2/
│   ├── samwise.py
│   ├── ...
...

⚙️ Environment Setup

The code has been tested with Python 3.10 and PyTorch 2.3.1 (with CUDA 11.8). To set up the environment using Conda, run:

conda create --name samwise python=3.10 -y
conda activate samwise
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🎥 Referring Video Object Segmentation (RVOS)

Reproducing Our Results: Below, we provide the model weights to replicate the results of our paper.

Dataset	Total Parameters	Trainable Params	J&F	Model	Zip
MeViS	210 M	4.9 M	49.5	Weights	Zip
MeViS - valid_u	210 M	4.9 M	57.1	Weights	-
Ref-Youtube-VOS	210 M	4.9 M	69.2	Weights	Zip
Ref-Davis	210 M	4.9 M	70.6	Weights	-

To evaluate the model on MeViS - valid_u run the following command:

python3 inference_mevis.py --split valid_u --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head

For Ref-Davis run the following command:

python3 inference_davis.py --resume=[/path/to/model_weight] --name_exp [name_exp]  --HSA --use_cme_head

For MeViS and Ref-Youtube-VOS, upload the zip file to:

🖼️ Referring Image Segmentation (RIS)

We also test SAMWISE on the Referring Image Segmentation (RIS) benchmark.

RefCOCO	RefCOCO+	RefCOCOg	Model
75.6	65.8	66.8	Weights

Run the following to evaluate on RIS:

python3 main_pretrain.py --eval --resume=[/path/to/model_weight] --name_exp [name_exp] --disable_pred_obj_score

🚀 Training and Inference

For step-by-step instructions on training and inference, please refer to the Training and Inference Guide.

This document includes all necessary details on:

Training SAMWISE on different datasets
Running inference and evaluating performance
Submitting results to online benchmarks

▶️ Try SAMWISE on Your Own Data

We provide a simple script to run SAMWISE on your own inputs using natural language prompts.
Supported input types:

A single image (.jpg, .png, .jpeg)
A video (.mp4)
A folder of consecutive video frames (e.g., frame_00001.png, frame_00002.png, ...)

Run the script:

python inference_demo.py --input_path <your_input> --text_prompts <text_prompt 1> <text_prompt 2>

Examples:

# On a single image
python inference_demo.py --input_path assets/example_image.jpg --text_prompts "the dog who is jumping" "the dog on the left" "the person with a yellow jacket"

# On a video
python inference_demo.py --input_path assets/example_video.mp4 --text_prompts "the horse jumping" "the person riding the horse"

# On a folder of consecutive frames
python inference_demo.py --input_path demo_sequence --text_prompts "the horse jumping" "the person riding the horse"

Output:

Image input:
- demo_output/<text_prompt>/example_image.png
Video or sequence of frames:
- Segmented frames: demo_output/<text_prompt>/frame_*.png
- Segmented video: demo_output/<text_prompt>.mp4

🔗 Acknowledgements

We build upon the amazing work from:

Citation

@InProceedings{cuttano2025samwise,
    author    = {Cuttano, Claudia and Trivigno, Gabriele and Rosi, Gabriele and Masone, Carlo and Averta, Giuseppe},
    title     = {SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3395-3405}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
datasets		datasets
davis2017		davis2017
docs		docs
fairseq		fairseq
models		models
tools		tools
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
engine.py		engine.py
inference_davis.py		inference_davis.py
inference_demo.py		inference_demo.py
inference_mevis.py		inference_mevis.py
inference_ytvos.py		inference_ytvos.py
main.py		main.py
main_pretrain.py		main_pretrain.py
opts.py		opts.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

👀 SAMWISE in Action

📊 Data Preparation

⚙️ Environment Setup

🎥 Referring Video Object Segmentation (RVOS)

🖼️ Referring Image Segmentation (RIS)

🚀 Training and Inference

▶️ Try SAMWISE on Your Own Data

🔗 Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

ClaudiaCuttano/SAMWISE

Folders and files

Latest commit

History

Repository files navigation

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

👀 SAMWISE in Action

📊 Data Preparation

⚙️ Environment Setup

🎥 Referring Video Object Segmentation (RVOS)

🖼️ Referring Image Segmentation (RIS)

🚀 Training and Inference

▶️ Try SAMWISE on Your Own Data

🔗 Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages