🎉 CVPR 2025 Highlight 🎉
Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta
Welcome to the official repository for "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation".
In this work, we build upon Segment Anything 2 (SAM2) and make it wiser by infusing natural language understanding and explicit temporal modeling.
🚀 No fine-tuning of SAM2 weights.
🧠 No reliance on external VLMs for multi-modal interaction.
📈 State-of-the-art performance across multiple benchmarks.
💡 Minimal overhead: just 4.9 M additional parameters!
📄 Read our paper on arXiv
🌍 Demo & Project Page
📢 [May 2025] Check out SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation — a unified framework powered by SAM2, supporting points, boxes, scribbles, and masks. No external models, no prompt-specific tweaks. 👉 Checkout SANSA
📢 [June 2025] Try SAMWISE on your own data: we’ve added a simple script to run SAMWISE on videos or images using textual prompts. 👉 Try SAMWISE on Your Own Data.
SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot — no extra training, just living up to its namesake! 🧙♂️✨
samwise_in_action.mp4
Before running SAMWISE, set up your dataset: refer to data.md for detailed data preparation.
Once organized, the directory structure should look like this:
SAMWISE/
├── data/
│ ├── ref-youtube-vos/
│ ├── ref-davis/
│ ├── MeViS/
├── datasets/
├── models/
│ ├── sam2/
│ ├── samwise.py
│ ├── ...
...
The code has been tested with Python 3.10 and PyTorch 2.3.1 (with CUDA 11.8). To set up the environment using Conda, run:
conda create --name samwise python=3.10 -y
conda activate samwise
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txtReproducing Our Results: Below, we provide the model weights to replicate the results of our paper.
| Dataset | Total Parameters | Trainable Params | J&F | Model | Zip |
|---|---|---|---|---|---|
| MeViS | 210 M | 4.9 M | 49.5 | Weights | Zip |
| MeViS - valid_u | 210 M | 4.9 M | 57.1 | Weights | - |
| Ref-Youtube-VOS | 210 M | 4.9 M | 69.2 | Weights | Zip |
| Ref-Davis | 210 M | 4.9 M | 70.6 | Weights | - |
To evaluate the model on MeViS - valid_u run the following command:
python3 inference_mevis.py --split valid_u --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head
For Ref-Davis run the following command:
python3 inference_davis.py --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head
For MeViS and Ref-Youtube-VOS, upload the zip file to:
We also test SAMWISE on the Referring Image Segmentation (RIS) benchmark.
| RefCOCO | RefCOCO+ | RefCOCOg | Model |
|---|---|---|---|
| 75.6 | 65.8 | 66.8 | Weights |
Run the following to evaluate on RIS:
python3 main_pretrain.py --eval --resume=[/path/to/model_weight] --name_exp [name_exp] --disable_pred_obj_score
For step-by-step instructions on training and inference, please refer to the Training and Inference Guide.
This document includes all necessary details on:
- Training SAMWISE on different datasets
- Running inference and evaluating performance
- Submitting results to online benchmarks
We provide a simple script to run SAMWISE on your own inputs using natural language prompts.
Supported input types:
- A single image (.jpg, .png, .jpeg)
- A video (.mp4)
- A folder of consecutive video frames (e.g., frame_00001.png, frame_00002.png, ...)
Run the script:
python inference_demo.py --input_path <your_input> --text_prompts <text_prompt 1> <text_prompt 2>
Examples:
# On a single image
python inference_demo.py --input_path assets/example_image.jpg --text_prompts "the dog who is jumping" "the dog on the left" "the person with a yellow jacket"
# On a video
python inference_demo.py --input_path assets/example_video.mp4 --text_prompts "the horse jumping" "the person riding the horse"
# On a folder of consecutive frames
python inference_demo.py --input_path demo_sequence --text_prompts "the horse jumping" "the person riding the horse"
Output:
- Image input:
- demo_output/<text_prompt>/example_image.png
- Video or sequence of frames:
- Segmented frames: demo_output/<text_prompt>/frame_*.png
- Segmented video: demo_output/<text_prompt>.mp4
We build upon the amazing work from:
@InProceedings{cuttano2025samwise,
author = {Cuttano, Claudia and Trivigno, Gabriele and Rosi, Gabriele and Masone, Carlo and Averta, Giuseppe},
title = {SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
month = {June},
year = {2025},
pages = {3395-3405}
}