-
We propose a novel framework, SAM2-LOVE that firstly leverages SAM2 to achieve pixel-wise understanding in the LAVS by designing a multimodal fusion module.
-
We develop creative token propagation and accumulation strategies to improve spatio-temporal comprehension of the promtable token.
-
Extensive experiments on Ref-AVS dataset demonstrate the superiority of our method, with ablation studies highlighting the simplicity and effectiveness of its modules.
Our work is primarily based on EVF-SAM, SAM2, Ref-AVS. We are sincerely grateful for their excellent works.
If you find our paper and code helpful for your research, please consider starring our repository β and citing our work βοΈ.
@inproceedings{wang2025sam2,
title={SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes},
author={Wang, Yuji and Xu, Haoran and Liu, Yong and Li, Jiaze and Tang, Yansong},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={28932--28941},
year={2025}
}
