This repository contains the code for the paper OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework [ECCV'2024].
[2024/07] X-Prompt: Our new work X-Prompt: Multimodal Visual Prompt for Video Object Segmentation has been accepted by [ACMMM'2024]. This work proposes a novel multimodal VOS approach, leveraging OneVOS as the RGB-VOS foundation model and incorporating Multi-modal Adaptation Experts to integrate additional modality-specific knowledge. The proposed method achieves SOTA performance across 4 benchmarks.
Our trained models, benckmark scores, and pre-computed results reproduced by this project can be found in MODEL_ZOO.md.
conda create -n vos python=3.9 -y
conda activate vos
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
git clone https://github.com/ClementPinard/Pytorch-Correlation-extension.git
cd Pytorch-Correlation-extension
python setup.py install
cd -We follow the same data preparation steps used in AOT, including Static datasets, DAVIS and Youtube-VOS. Besides, we also use the more comlicated dataset MOSE and long-term vos dataset LVOS V1 for training or test.
├── OneVOS
├── datasets
│ ├── Static
│ │ └── ...
│ ├── DAVIS
│ │ └── ...
│ ├── YOUTUBE-VOS
│ │ └── ...
│ ├── MOSE
│ │ └── ...
│ ├── LVOS
│ │ └── ...
│ ├── ...We initialize OneVOS using the weights of ConvMAE-Base as the backbone. You can download the pretrained weights directly from pretrain_backbone and place them in the OneVOS/pretrain_weights/.
Stages:
PRE: the pre-training stage with static images.PRE_YTB_DAV: the main-training stage with YouTube-VOS and DAVIS.PRE_YTB_DAV_MOSE: the main-training stage with YouTube-VOS and DAVIS and MOSE.
The training script can be referenced train_examples_pre_ytb_dav.sh, and train_examples_pre_ytb_dav_mose.sh
The inference script can be referenced eval_examples_pre_ytb_dav.sh, and eval_examples_pre_ytb_dav_mose.sh
If you find this repository useful, please consider giving a star and citation:
@inproceedings{li2025onevos,
title={Onevos: unifying video object segmentation with all-in-one transformer framework},
author={Li, Wanyun and Guo, Pinxue and Zhou, Xinyu and Hong, Lingyi and He, Yangji and Zheng, Xiangyu and Zhang, Wei and Zhang, Wenqiang},
booktitle={European Conference on Computer Vision},
pages={20--40},
year={2025},
organization={Springer}
}
