Overview of the DIG Framework. An LLM first classifies the input query as either global or localized. Global queries trigger uniform sampling across the entire video. Conversely, localized queries utilize CAFS and reward assignment to generate a reward distribution; this distribution is used to construct a refined video for targeted uniform sampling. The selected frames are subsequently processed by the LMM for final inference.
- [2026-02-21] 🎉 Exciting news! Our paper has been accepted to CVPR 2026!
Set up a clean environment to avoid conflicts.
# Create and activate conda environment
conda create -n dig python=3.10 -y
conda activate dig
# Clone the repository
git clone git@github.com:Jialuo-Li/DIG.git
cd DIG
# Install dependencies
bash scripts/install.shDownload the supported benchmarks and organize them in the data/ directory.
| Dataset | Link | Description |
|---|---|---|
| MLVU | Hugging Face | Multi-Task Long Video Understanding |
| LongVideoBench | Hugging Face | Long-context video QA |
| VideoMME | Hugging Face | Comprehensive video evaluation |
Directory Structure:
data/
├── mlvu/
│ ├── 1.mp4
│ └── ...
├── longvideobench/
│ └── videos/
│ ├── 1.mp4
│ └── ...
└── videomme/
└── data/
├── 1.mp4
└── ...
We provide pre-computed query types, r-frame indices, and reward values from Qwen2.5-VL-7B/32B and Qwen3-VL-8B in the rewards/ directory. This allows you to directly evaluate DIG's performance.
1. Set the Target Model
export MODEL_NAME=Qwen/Qwen2.5-VL-7B-Instruct
# Supported: Qwen/Qwen2.5-VL-32B-Instruct, Qwen/Qwen3-VL-8B-Instruct2. Video Refinement (Key Frame Selection) Extract the optimal keyframes based on the pre-computed rewards.
# Usage: bash scripts/video_refinement.sh <dataset>
bash scripts/video_refinement.sh mlvu # Options: longvideobench, videomme3. Run Evaluation
Evaluate using the lmms-eval framework.
# Usage: bash scripts/eval/qwen25vl.sh <dataset> <method>
# For Qwen2.5-VL
bash scripts/eval/qwen25vl.sh mlvu DIG
# For Qwen3-VL
bash scripts/eval/qwen3vl.sh mlvu DIG Supported Methods:
DIG: Uses DIG pipeline.UNI: Uses standard uniform sampling.
If you wish to run the entire DIG process from scratch, please follow these steps.
The LLM is used to analyzes the user query as Global or Localized.
# 1. Launch the LLM Server
export MODEL_NAME=Qwen/Qwen3-Next-80B-A3B-Instruct
bash scripts/launch_llm.sh
# 2. Run Identification
bash scripts/query_identification.sh mlvu # Options: longvideobench, videommeWe use DINOv2 to extract features and select representative "r-frames" from the video.
bash scripts/cafs.sh mlvu # Options: longvideobench, videommeThe LMM is used to score the r-frames based on their relevance to the user query.
# 1. Launch the LMM Server (e.g., Qwen2.5-VL-7B)
export MODEL_NAME=Qwen/Qwen2.5-VL-7B-Instruct
bash scripts/launch_mllm.sh
# 2. Assign Rewards
bash scripts/reward_assignment.sh mlvu # Options: longvideobench, videomme
# Results are saved to the 'rewards/' directory.Use the generated rewards to construct the final frame input for inference. (See the "Inference" section above for this step)
DIG/
├── data/ # Dataset storage
├── lmms-eval/ # Evaluation framework
├── pipeline/ # Core DIG implementation
│ ├── cafs.py # Content-Aware Frame Selection
│ ├── query_identification.py # Global vs. Localized classification
│ ├── reward_assignment.py # Frame relevance scoring
│ └── video_refinement.py # Final frame selection
├── rewards/ # Pre-computed metadata & rewards
├── scripts/ # execution scripts
│ ├── eval/ # Evaluation launchers
│ ├── launch_llm.sh # vLLM Server for Query Identification
│ └── launch_mllm.sh # vLLM Server for Reward Assignment
├── utils.py # Dataset loader & Prompt templates
└── requirements.txt # Python dependencies
If you find DIG useful for your research or projects, we would greatly appreciate it if you could cite our work:
@misc{li2025dividegroundadaptingframe,
title={Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding},
author={Jialuo Li and Bin Li and Jiahao Li and Yan Lu},
year={2025},
eprint={2512.04000},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.04000},
}