Skip to content

ClaudiaCuttano/SAMWISE

Repository files navigation

SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [CVPR 2025]

🎉 CVPR 2025 Highlight 🎉

Claudia Cuttano, Gabriele Trivigno, Gabriele Rosi, Carlo Masone, Giuseppe Averta

Welcome to the official repository for "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation".

In this work, we build upon Segment Anything 2 (SAM2) and make it wiser by infusing natural language understanding and explicit temporal modeling.
🚀 No fine-tuning of SAM2 weights.
🧠 No reliance on external VLMs for multi-modal interaction.
📈 State-of-the-art performance across multiple benchmarks.
💡 Minimal overhead: just 4.9 M additional parameters!

📄 Read our paper on arXiv
🌍 Demo & Project Page

📢 [May 2025] Check out SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation — a unified framework powered by SAM2, supporting points, boxes, scribbles, and masks. No external models, no prompt-specific tweaks. 👉 Checkout SANSA

📢 [June 2025] Try SAMWISE on your own data: we’ve added a simple script to run SAMWISE on videos or images using textual prompts. 👉 Try SAMWISE on Your Own Data.

👀 SAMWISE in Action

SAMWISE (our model, not the hobbit) segments objects from The Lord of the Rings in zero-shot — no extra training, just living up to its namesake! 🧙‍♂️✨

samwise_in_action.mp4

📊 Data Preparation

Before running SAMWISE, set up your dataset: refer to data.md for detailed data preparation.
Once organized, the directory structure should look like this:

SAMWISE/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── MeViS/
├── datasets/
├── models/
│   ├── sam2/
│   ├── samwise.py
│   ├── ...
...

⚙️ Environment Setup

The code has been tested with Python 3.10 and PyTorch 2.3.1 (with CUDA 11.8). To set up the environment using Conda, run:

conda create --name samwise python=3.10 -y
conda activate samwise
pip install torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🎥 Referring Video Object Segmentation (RVOS)

Reproducing Our Results: Below, we provide the model weights to replicate the results of our paper.

Dataset Total Parameters Trainable Params J&F Model Zip
MeViS 210 M 4.9 M 49.5 Weights Zip
MeViS - valid_u 210 M 4.9 M 57.1 Weights -
Ref-Youtube-VOS 210 M 4.9 M 69.2 Weights Zip
Ref-Davis 210 M 4.9 M 70.6 Weights -

To evaluate the model on MeViS - valid_u run the following command:

python3 inference_mevis.py --split valid_u --resume=[/path/to/model_weight] --name_exp [name_exp] --HSA --use_cme_head

For Ref-Davis run the following command:

python3 inference_davis.py --resume=[/path/to/model_weight] --name_exp [name_exp]  --HSA --use_cme_head

For MeViS and Ref-Youtube-VOS, upload the zip file to:

🖼️ Referring Image Segmentation (RIS)

We also test SAMWISE on the Referring Image Segmentation (RIS) benchmark.

RefCOCO RefCOCO+ RefCOCOg Model
75.6 65.8 66.8 Weights

Run the following to evaluate on RIS:

python3 main_pretrain.py --eval --resume=[/path/to/model_weight] --name_exp [name_exp] --disable_pred_obj_score 

🚀 Training and Inference

For step-by-step instructions on training and inference, please refer to the Training and Inference Guide.

This document includes all necessary details on:

  • Training SAMWISE on different datasets
  • Running inference and evaluating performance
  • Submitting results to online benchmarks

▶️ Try SAMWISE on Your Own Data

We provide a simple script to run SAMWISE on your own inputs using natural language prompts.
Supported input types:

  • A single image (.jpg, .png, .jpeg)
  • A video (.mp4)
  • A folder of consecutive video frames (e.g., frame_00001.png, frame_00002.png, ...)

Run the script:

python inference_demo.py --input_path <your_input> --text_prompts <text_prompt 1> <text_prompt 2>

Examples:

# On a single image
python inference_demo.py --input_path assets/example_image.jpg --text_prompts "the dog who is jumping" "the dog on the left" "the person with a yellow jacket"

# On a video
python inference_demo.py --input_path assets/example_video.mp4 --text_prompts "the horse jumping" "the person riding the horse"

# On a folder of consecutive frames
python inference_demo.py --input_path demo_sequence --text_prompts "the horse jumping" "the person riding the horse"

Output:

  • Image input:
    • demo_output/<text_prompt>/example_image.png
  • Video or sequence of frames:
    • Segmented frames: demo_output/<text_prompt>/frame_*.png
    • Segmented video: demo_output/<text_prompt>.mp4

🔗 Acknowledgements

We build upon the amazing work from:

Citation

@InProceedings{cuttano2025samwise,
    author    = {Cuttano, Claudia and Trivigno, Gabriele and Rosi, Gabriele and Masone, Carlo and Averta, Giuseppe},
    title     = {SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {3395-3405}
}

About

[CVPR 2025 Highlight] Official repository for the paper: "SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages