π (ACM MM 25) Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
Official code repository for our ACM MM 2025 paper:
"Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation"
Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang π [Paper Link]
This repository provides the official implementation of our ACM MM 2025 paper, "Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation" π [Paper Link] .
In this work, we propose a novel method HMHI-Net for Unsupervised Video Object Segmentation (UVOS) with Shallow Features for Memroy. The method features:
- π§ A novel Hierarchical Memory Architecture that simultaneously incorporates shallow- and high-level features for memory, facilitating UVOS with both pixel-level details and semantic richness stored in memory banks.
- π The Heterogeneous Mutual Refinement Mechanism to perform interaction across two memory banks, through the pixel-guided local alignment module (PLAM) and the semantic-guided global integration module (SGIM) respectively.
- β‘ HMHI-Net achieves SOTA on common UVOS and VSOD benchmarks, with 89.8% J&F on DAVIS-16, 86.9% J on FBMS and 76.2% J on YouTube-Objetcs.
(a) Overall pipeline of HMHI-Net. (b) Memory readout mechanism to refine current frame. (c) Pixel-guided local alignment module. (d)Semantic-guided global integration module. (e) Memory update mechanism with the reference encoder.
| Demo1 | Demo2 |
|---|---|
![]() |
![]() |
| Car-roundabout_Davis16 | Dog_Davis16 |
| Demo3 | Demo4 |
|---|---|
![]() |
![]() |
| Drift-straight_Davis16 | Parkour_Davis16 |
pip install -r requirements.txtThanks to π [Calledit] for providing a more detailed environment installation script!
#!/bin/bash
conda create -n env_name python=3.10
conda activate env_name
pip install torch numpy opencv-python timm mmcv bytecode IPython tensorboard scikit-image
git clone https://github.com/luo3300612/Visualizer
cd Visualizer/
python setup.py install
cd ..
mkdir -p checkpoint/pretrained/mit/
wget -o checkpoint/pretrained/mit/mit_b1.pth https://download.openmmlab.com/mmsegmentation/v0.5/segformer/segformer_mit-b1_512x512_160k_ade20k/segformer_mit-b1_512x512_160k_ade20k_20220620_112037-c3f39e00.pth
pip install gdown
gdown --id 1OG_Dla9f-sBuoi3Q6mF55Au3rU-Fc9Sg -O checkpoint/infermodel.pth
mkdir -p Your_eval_data_path/FBMS2SEG_byvideo/frame/val
| Dataset | Download Link |
|---|---|
| YouTube-VOS | π Download |
| DAVIS-16 | π Download |
| FBMS | π Download |
| Youtube-Objects | π Download |
| DAVSOD | π Download |
| ViSal | π Download |
Following previous UVOS works, optical flow maps for both training and inference data are generated through π [RAFT].
Please Ensure to organize the data files as follows:
data/
βββ DAVIS-16/
βββ Images/
| βββ train/
| | βββ video_name1/
| | βββ video_name2/
| | ...
| βββ val/
| βββ video_name1/
| βββ video_name2/
| ...
βββ Annotations/
| βββ train/
| | βββ video_name1/
| | βββ video_name2/
| | ...
| βββ val/
| βββ video_name1/
| βββ video_name2/
| ...
βββ Flows/
| βββ train/
| | βββ video_name1/
| | βββ video_name2/
| | ...
| βββ val/
| βββ video_name1/
| βββ video_name2/
| ...
βββ Youtube-VOS/
βββ Images/
...
βββ Annotations/
...
βββ Flows/
...
...Download the pretrained model and save them in './checkpoint/pretrained/' for model training.
We adopt the Segformer models pretrained on ImageNet-1k
| Pretrained Model | Model Link |
|---|---|
| π Segformer (NeurIPS 21) | π Mit_b0 - Mit_b5 or π GoogleDrive |
| π Swin-Transformer (ICCV 21) | π Swin-T - Swin-B |
| Task | Download Link |
|---|---|
| π DAVIS-16 | |
| UVOS Checkpoints | π FBMS |
| π Youtube-Objects | |
| π DAVIS-16 | |
| VSOD Checkpoints | π DAVSOD |
| π FBMS | |
| π ViSal |
# Certain config values in the file may require modification to suit your local setup.
bash scripts/train.shLoad the best-performing checkpoint on the corresponding dataset at the Training stage and start Fine-Tuning.
# Certain config values in the file may require modification to suit your local setup.
bash scripts/finetune.sh# Certain config values in the file may require modification to suit your local setup.
bash scripts/infer.sh# Certain config values in the file may require modification to suit your local setup.
# For UVOS tasks
python utils/val_zvos.py
# For VSOD tasks
python utils/val_vsod.pyThis repository is built upon [π Isomer] and [π SAM], originally proposed in:
-
"Isomer: Isomerous Transformer for Zero-Shot Video Object Segmentation", Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang, Weibo Su, Lei Zhang ICCV, 2023. [π Paper]
-
"Segment Anything" Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick, arxiv, 2023. [π Paper]
We reuse parts of their codebase, including:
-
The data loading pipeline
-
Model initialization logic
-
Training routines
-
Module formulation
The model is licensed under the Apache 2.0 license.
@inproceedings{Zheng2025mm,
title = {Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation},
author = {Xiangyu Zheng, Songcheng He, Wanyu Li, Xiaoqiang Li, Wei Zhang},
booktitle = {Proceedings of the ACM International Conference on Multimedia (ACM MM)},
year = {2025}
}



