Skip to content

cvlab-kaist/InterRVOS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InterRVOS: Interaction-Aware Referring Video Object Segmentation

Woojeong JinSeongchan KimJaeho LeeSeungryong Kim
KAIST AI
†: Corresponding Author

ArXiv 2025

📢 News

  • Upcoming: InterRVOS-127K dataset and ReVIOSa checkpoints
  • Upcoming : Data annotation pipeline
  • Released: Training code, inference & evaluation code
  • Released: InterRVOS on ArXiv and Project Page

🎯 Release Progress

  • Model checkpoints
  • InterRVOS-127K dataset (Training & Evaluation)
  • Data annotation pipeline code
  • Inference & evaluation code
  • Training code

Overview

This repository contains the code for the paper InterRVOS: Interaction-aware Referring Video Object Segmentation.

In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. Please refer to the project page for detailed visualization results.

Model Download

‼️ We release the pretrained ReVIOSa-1B and ReVIOSa-4B model on Hugging Face 🤗: ReVIOSa-1B and ReVIOSa-4B

🚀 Quick Start

import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os

# load the model and tokenizer
path = "wooj0216/ReVIOSa-4B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the child reaching out to man."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]
masks = return_dict['prediction_masks']

Dataset

‼️ We release our dataset InterRVOS-127K model on Hugging Face 🤗: wooj0216/InterRVOS-127K

Model Training & Inference

Instructions for training, inference, and evaluation are provided in ReVIOSa/README.md.

Data Annotation

Our automatic data-annotation pipeline are provided in the data_annotation.

Acknowledgement

This project is based on Sa2VA. Many thanks to the authors for their great works!

References

If you find this repository useful, please consider referring to the following paper:

@misc{jin2025interrvosinteractionawarereferringvideo,
    title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
    author={Woojeong Jin and Seongchan Kim and Jaeho Lee and Seungryong Kim},
    year={2025},
    eprint={2506.02356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.02356},
}

About

Official implementation of "InterRVOS: Interaction-aware Referring Video Object Segmentation".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages