- Upcoming: InterRVOS-127K dataset and ReVIOSa checkpoints
- Upcoming : Data annotation pipeline
- Released: Training code, inference & evaluation code
- Released: InterRVOS on ArXiv and Project Page
- Model checkpoints
- InterRVOS-127K dataset (Training & Evaluation)
- Data annotation pipeline code
- Inference & evaluation code
- Training code
This repository contains the code for the paper InterRVOS: Interaction-aware Referring Video Object Segmentation.
In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. Please refer to the project page for detailed visualization results.
import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os
# load the model and tokenizer
path = "wooj0216/ReVIOSa-4B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the child reaching out to man."
input_dict = {
'video': images_paths,
'text': text_prompts,
'past_text': '',
'mask_prompts': None,
'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]
masks = return_dict['prediction_masks']
Instructions for training, inference, and evaluation are provided in ReVIOSa/README.md.
Our automatic data-annotation pipeline are provided in the data_annotation.
This project is based on Sa2VA. Many thanks to the authors for their great works!
If you find this repository useful, please consider referring to the following paper:
@misc{jin2025interrvosinteractionawarereferringvideo,
title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
author={Woojeong Jin and Seongchan Kim and Jaeho Lee and Seungryong Kim},
year={2025},
eprint={2506.02356},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.02356},
}