Skip to content

Dmmm1997/MomentSeg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

MomentSeg: Moment-Centric Sampling for Enhanced
Video Pixel Understanding

Ming Dai1, Sen Yang2, Boqiang Duan2, Wankou Yang1, Jingdong Wang2

1Southeast University; 2Baidu VIS


Demo Animation


MomentSeg is a unified MLLM for pixel-level vision–language understanding, designed with a moment-centric sampling strategy to better capture fine-grained semantics in video. It flexibly supports a range of tasks, including referring and reasoning image/video segmentation, video temporal grounding, and image/video question answering.

🔥 News

  • 2025.10.12 🔥 Our paper and video demo has been released.

🕒 Open-Source Plan

  • Paper and Video Demo
  • Model Weights and Inference Instructions — Coming soon
  • Training Code and Detailed Documentation — To be released in a later phase

🎥 Demo

Demo 1 Input Video (Source: Internet):

Error

Instruction: "Please segment the monkey that is scratching its ear."

Demo 2 Input Video (Source: Internet):

Error

Instruction: "Please segment the person standing in the center wearing blue clothes."

🏆 Performance

🖼️ Image-level Segmentation

(Referring Image Segmentation & Reasoning Segmentation)

Benchmark Evaluation Results (3B/7B)
RefCOCO (RES) val: 82.1/82.6testA: 83.7/85.1testB: 79.2/80.2
RefCOCO+ (RES) val: 76.9/78.2testA: 81.1/81.9testB: 71.8/71.3
RefCOCOg (RES) val(U): 78.8/80.1test(U): 79.2/80.1
ReasonSeg val: 62.0/63.3test: 64.3/65.5
GCG val: 67.0/67.8test: 65.9/67.9
🎬 Video-level Segmentation

(Referring Video Object Segmentation)

Benchmark Evaluation Results (3B/7B)
ReVOS J: 59.7/61.9F: 64.4/66.1J&F: 62.1/64.0
ReasonVOS J: 58.2/59.2F: 65.3/66.1J&F: 61.7/62.7
MeViS (val_u) J: 58.1/58.7F: 65.9/66.5J&F: 62.0/62.6
MeViS (val) J: 51.7/53.9F: 58.0/60.2J&F: 54.8/57.1
Ref-YouTube-VOS J: 69.8/70.1F: 74.3/74.5J&F: 72.0/72.3
Ref-DAVIS17 J: 72.2/73.2F: 80.6/81.7J&F: 76.4/77.4
Ref-SAV J: 79.2/80.1F: 80.6/81.4J&F: 79.9/80.8
⏱️ Temporal Sentence Grounding

(Temporal Sentence Grounding)

Benchmark Evaluation Results (3B)
Charades-STA R@0.3: 76.1R@0.5: 58.2mIoU: 50.0
ActivityNet-Grounding R@0.3: 65.6R@0.5: 45.6mIoU: 45.1

🤖 Model Zoo (TODO)

Model Name Base MLLM HF Link
MomentSeg-3B Qwen2.5-VL-3B [🤗 link]
MomentSeg-7B Qwen2.5-VL-7B [🤗 link]

🚀 Training (TODO)

📊 Evaluation (TODO)

📖 Citation

Please kindly cite our paper if you find this project helpful.

@misc{momentseg,
      title={MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding}, 
      author={Ming Dai and Sen Yang and Boqiang Duan and Wankou Yang and Jingdong Wang},
      year={2025},
      eprint={2510.09274},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09274}, 
}

About

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors