Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2
1Southeast University; 2Baidu VIS
MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.
2025.10.12🔥 Our paper and video demo has been released.
- Paper and Video Demo
- Model Weights and Inference Instructions — Coming soon
- Training Code and Detailed Documentation — To be released in a later phase
Demo 1
Input Video (Source: Internet):Instruction: "Please segment the monkey that is scratching its ear."
Demo 2
Input Video (Source: Internet):Instruction: "Please segment the person standing in the center wearing blue clothes."
🖼️ Image-level Segmentation
(Referring Image Segmentation & Reasoning Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| RefCOCO (RES) | val: 82.1/82.6 testA: 83.7/85.1 testB: 79.2/80.2 |
| RefCOCO+ (RES) | val: 76.9/78.2 testA: 81.1/81.9 testB: 71.8/71.3 |
| RefCOCOg (RES) | val(U): 78.8/80.1 test(U): 79.2/80.1 |
| ReasonSeg | val: 62.0/63.3 test: 64.3/65.5 |
| GCG | val: 67.0/67.8 test: 65.9/67.9 |
🎬 Video-level Segmentation
(Referring Video Object Segmentation)
| Benchmark | Evaluation Results (3B/7B) |
|---|---|
| ReVOS | J: 59.7/61.9 F: 64.4/66.1 J&F: 62.1/64.0 |
| ReasonVOS | J: 58.2/59.2 F: 65.3/66.1 J&F: 61.7/62.7 |
| MeViS (val_u) | J: 58.1/58.7 F: 65.9/66.5 J&F: 62.0/62.6 |
| MeViS (val) | J: 51.7/53.9 F: 58.0/60.2 J&F: 54.8/57.1 |
| Ref-YouTube-VOS | J: 69.8/70.1 F: 74.3/74.5 J&F: 72.0/72.3 |
| Ref-DAVIS17 | J: 72.2/73.2 F: 80.6/81.7 J&F: 76.4/77.4 |
| Ref-SAV | J: 79.2/80.1 F: 80.6/81.4 J&F: 79.9/80.8 |
⏱️ Temporal Sentence Grounding
(Temporal Sentence Grounding)
| Benchmark | Evaluation Results (3B) |
|---|---|
| Charades-STA | R@0.3: 76.1 R@0.5: 58.2 mIoU: 50.0 |
| ActivityNet-Grounding | R@0.3: 65.6 R@0.5: 45.6 mIoU: 45.1 |
| Model Name | Base MLLM | HF Link |
|---|---|---|
| MomentSeg-3B | Qwen2.5-VL-3B | [🤗 link] |
| MomentSeg-7B | Qwen2.5-VL-7B | [🤗 link] |
Please kindly cite our paper if you find this project helpful.
@misc{momentseg,
title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding},
author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
year={2025},
eprint={2510.09274},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.09274},
}


