InterRVOS: Interaction-Aware Referring Video Object Segmentation

Woojeong Jin Seongchan Kim Jaeho Lee Seungryong Kim†
KAIST AI
†: Corresponding Author

ArXiv 2025

📢 News

Upcoming: InterRVOS-127K dataset and ReVIOSa checkpoints
Upcoming : Data annotation pipeline
Released: Training code, inference & evaluation code
Released: InterRVOS on ArXiv and Project Page

🎯 Release Progress

Overview

This repository contains the code for the paper InterRVOS: Interaction-aware Referring Video Object Segmentation.

In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. Please refer to the project page for detailed visualization results.

Model Download

‼️ We release the pretrained ReVIOSa-1B and ReVIOSa-4B model on Hugging Face 🤗: ReVIOSa-1B and ReVIOSa-4B

🚀 Quick Start

import torch
from transformers import AutoTokenizer, AutoModel
from PIL import Image
import numpy as np
import os

# load the model and tokenizer
path = "wooj0216/ReVIOSa-4B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

video_folder = "/PATH/TO/VIDEO_FOLDER"
images_paths = os.listdir(video_folder)
images_paths = [os.path.join(video_folder, image_path) for image_name in images_paths]
text_prompts = "<image>Please segment the child reaching out to man."
input_dict = {
    'video': images_paths,
    'text': text_prompts,
    'past_text': '',
    'mask_prompts': None,
    'tokenizer': tokenizer,
}
return_dict = model.predict_forward(**input_dict)
answer = return_dict["prediction"]
masks = return_dict['prediction_masks']

Dataset

‼️ We release our dataset InterRVOS-127K model on Hugging Face 🤗: wooj0216/InterRVOS-127K

Model Training & Inference

Instructions for training, inference, and evaluation are provided in ReVIOSa/README.md.

Data Annotation

Our automatic data-annotation pipeline are provided in the data_annotation.

Acknowledgement

This project is based on Sa2VA. Many thanks to the authors for their great works!

References

If you find this repository useful, please consider referring to the following paper:

@misc{jin2025interrvosinteractionawarereferringvideo,
    title={InterRVOS: Interaction-aware Referring Video Object Segmentation},
    author={Woojeong Jin and Seongchan Kim and Jaeho Lee and Seungryong Kim},
    year={2025},
    eprint={2506.02356},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2506.02356},
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ReVIOSa		ReVIOSa
assets		assets
data_annotation		data_annotation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InterRVOS: Interaction-Aware Referring Video Object Segmentation

📢 News

🎯 Release Progress

Overview

Model Download

🚀 Quick Start

Dataset

Model Training & Inference

Data Annotation

Acknowledgement

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InterRVOS: Interaction-Aware Referring Video Object Segmentation

📢 News

🎯 Release Progress

Overview

Model Download

🚀 Quick Start

Dataset

Model Training & Inference

Data Annotation

Acknowledgement

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages