GitHub - Dmmm1997/MomentSeg: MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Ming Dai¹, Sen Yang², Boqiang Duan², Wankou Yang¹, Jingdong Wang²

¹Southeast University; ²Baidu VIS

MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.

🔥 News

2025.10.12 🔥 Our paper and video demo has been released.

🕒 Open-Source Plan

Paper and Video Demo
Model Weights and Inference Instructions — Coming soon
Training Code and Detailed Documentation — To be released in a later phase

🎥 Demo

Demo 1

Input Video (Source: Internet):

Instruction: "Please segment the monkey that is scratching its ear."

Demo 2

Input Video (Source: Internet):

Instruction: "Please segment the person standing in the center wearing blue clothes."

🏆 Performance

🖼️ Image-level Segmentation

(Referring Image Segmentation & Reasoning Segmentation)

Benchmark	Evaluation Results (3B/7B)
RefCOCO (RES)	`val: 82.1/82.6` `testA: 83.7/85.1` `testB: 79.2/80.2`
RefCOCO+ (RES)	`val: 76.9/78.2` `testA: 81.1/81.9` `testB: 71.8/71.3`
RefCOCOg (RES)	`val(U): 78.8/80.1` `test(U): 79.2/80.1`
ReasonSeg	`val: 62.0/63.3` `test: 64.3/65.5`
GCG	`val: 67.0/67.8` `test: 65.9/67.9`

🎬 Video-level Segmentation

(Referring Video Object Segmentation)

Benchmark	Evaluation Results (3B/7B)
ReVOS	`J: 59.7/61.9` `F: 64.4/66.1` `J&F: 62.1/64.0`
ReasonVOS	`J: 58.2/59.2` `F: 65.3/66.1` `J&F: 61.7/62.7`
MeViS (val_u)	`J: 58.1/58.7` `F: 65.9/66.5` `J&F: 62.0/62.6`
MeViS (val)	`J: 51.7/53.9` `F: 58.0/60.2` `J&F: 54.8/57.1`
Ref-YouTube-VOS	`J: 69.8/70.1` `F: 74.3/74.5` `J&F: 72.0/72.3`
Ref-DAVIS17	`J: 72.2/73.2` `F: 80.6/81.7` `J&F: 76.4/77.4`
Ref-SAV	`J: 79.2/80.1` `F: 80.6/81.4` `J&F: 79.9/80.8`

⏱️ Temporal Sentence Grounding

(Temporal Sentence Grounding)

Benchmark	Evaluation Results (3B)
Charades-STA	`R@0.3: 76.1` `R@0.5: 58.2` `mIoU: 50.0`
ActivityNet-Grounding	`R@0.3: 65.6` `R@0.5: 45.6` `mIoU: 45.1`

🤖 Model Zoo (TODO)

Model Name	Base MLLM	HF Link
MomentSeg-3B	Qwen2.5-VL-3B	[🤗 link]
MomentSeg-7B	Qwen2.5-VL-7B	[🤗 link]

🚀 Training (TODO)

📊 Evaluation (TODO)

📖 Citation

Please kindly cite our paper if you find this project helpful.

@misc{momentseg,
      title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding}, 
      author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
      year={2025},
      eprint={2510.09274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09274}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

🔥 News

🕒 Open-Source Plan

🎥 Demo

🏆 Performance

🤖 Model Zoo (TODO)

🚀 Training (TODO)

📊 Evaluation (TODO)

📖 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

🔥 News

🕒 Open-Source Plan

🎥 Demo

🏆 Performance

🤖 Model Zoo (TODO)

🚀 Training (TODO)

📊 Evaluation (TODO)

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Packages