We propose DisCo, a visual encapsulation method designed to yield semantically distinct and temporally coherent visual representations for Video Multi-modal Large Language Models (MLLMs). [Paper]
|
|
|
|
We use Python 3.10 and PyTorch 2.0.1 for our code.
conda create -n disco python=3.10
cd disco && pip install -r requirements.txtPlease follow the steps below to prepare the data for training and evaluation.
[Stage 1] vision-text alignment: We use 900K video dense captions from ShareGPTVideo and 23K image dense captions from LLaVA. We also extracted instances for VCD module for each caption.
- Captions and extracted instances could be downloaded from here. [under caption_data]
- Images/videos could be fetched from the data pages of ShareGPTVideo and LLaVA.
After downloading the data, set each of the following paths in available_corpus.py:
anno_path -> # the path of the caption for this dataset
data_root -> # the directory of the videos for this dataset[Stage 2] instruction tuning: We follow the data recipe used in InternVideo2. Please refer to this page for details of how to download the datasets. Specifically, you only need to get these datasets:
- LLaVA reasoning & caption
- VideoChat caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQAAfter downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_stage2_sh"] in it.py.
[Stage 3] post-training (for InternVideo2-HD only): We adhere to the post-training stage of InterVideo2-HD. Please refer to this page for details. Specifically, you only need to get these datasets:
- LLaVA reasoning
- MiniGPT-4 caption
- ShareGPTVideo
- CLEVRER
- PerceptionTest
- STAR
- EgoQA
- NextQAAfter downloading the data, set the anno_path and data_root for each dataset included in available_corpus["videochat2_instruction_2024_0723_f16_post"] in it.py.
We evaluate on these benchmarks:
You could download the corresponding video and QA data from their data repositories.
If you want to carry out the training of DisCo, you need to prepare these pretrained models:
- Download Mistral-7B-Instruct-v0.3 and set
LLM_PATHin the training scripts underscripts/ptto your own Mistral path. - Download pretrained models for InternVideo2 or InternVideo2-HD, and set
PRETRAINED_MODELin the trainng scripts underscripts/ptto your own model path.
If you want to evaluate the DisCo model directly, you could download the trained checkpoints from here. You would also need Mistral-7B-Instruct-v0.3 as well.
We provide the training scripts for InternVideo2 and InternVideo2HD.
-
Stage 1: vision-text alignment
- First, go to config file for stage 1. Set
llm.pretrained_llm_pathto the path of your Mistral model. - Then, run the following script:
# InternVideo2 bash scripts/pt/internvideo2_stage1.sh # InternVideo2-HD bash scripts/pt/internvideo2_hd_stage1.sh
- First, go to config file for stage 1. Set
-
Stage 2: instruction tuning
- First, go to config file for stage 2. Set
llm.pretrained_llm_pathto the path of your Mistral model. - Then, run the following script. Set
PRETRAINED_MODELas the checkpoint path of stage 1.
# InternVideo2 bash scripts/pt/internvideo2_stage2.sh # InternVideo2-HD bash scripts/pt/internvideo2_hd_stage2_sft.sh
- First, go to config file for stage 2. Set
-
Stage3: HD tuning (for InternVideo2-HD only) Run the following script. Set
PRETRAINED_MODELas the checkpoint path of stage 2.# InternVideo2-HD bash scripts/pt/internvideo2_hd_stage2_post.sh
We provide evaluation scripts for the following benchmarks:
- MVBench, VideoMME, STAR, PerceptionTest, EgoSchema, MLVU
Here is an example of evaluation on MVBench:
- Go to MVBench evaluation script, set
CKPT_PATHto the path of DisCo checkpoint,LLM_PATHto the path of your Mistral model. - Go to evaluation/eval_mvbench.py, set the
data_listanddata_dirto the path of your own MVBench data (downloaded following Evaluation Data). - Run
bash scripts/eval/eval_mvbench.sh.
You can evaluate on other benchmarks in this way as well.
@misc{zhao2025discodistinctcoherentvisual,
title={DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs},
author={Jiahe Zhao and Rongkun Zheng and Yi Wang and Helin Wang and Hengshuang Zhao},
year={2025},
eprint={2507.10302},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.10302},
}The code is adapted from the video MLLM codebase developed by Chenting Wang. Thanks for his contributions!


