DIG:
Adapting Frame Selection to Query Types for Long-Form Video Understanding

Jialuo Li^1,2,*, Bin Li², Jiahao Li², Yan Lu²,

¹Tsinghua University ²Microsoft Research Asia ^*Work done during Jialuo's internship at MSRA

Overview of the DIG Framework. An LLM first classifies the input query as either global or localized. Global queries trigger uniform sampling across the entire video. Conversely, localized queries utilize CAFS and reward assignment to generate a reward distribution; this distribution is used to construct a refined video for targeted uniform sampling. The selected frames are subsequently processed by the LMM for final inference.

📰 News

[2026-02-21] 🎉 Exciting news! Our paper has been accepted to CVPR 2026!

🚀 Quick Start

1. Installation

Set up a clean environment to avoid conflicts.

# Create and activate conda environment
conda create -n dig python=3.10 -y
conda activate dig

# Clone the repository
git clone git@github.com:Jialuo-Li/DIG.git
cd DIG

# Install dependencies
bash scripts/install.sh

2. Data Preparation

Download the supported benchmarks and organize them in the data/ directory.

Dataset	Link	Description
MLVU	Hugging Face	Multi-Task Long Video Understanding
LongVideoBench	Hugging Face	Long-context video QA
VideoMME	Hugging Face	Comprehensive video evaluation

Directory Structure:

data/
├── mlvu/
│   ├── 1.mp4
│   └── ...
├── longvideobench/
│   └── videos/
│       ├── 1.mp4
│       └── ...
└── videomme/
    └── data/
        ├── 1.mp4
        └── ...

3. Inference

We provide pre-computed query types, r-frame indices, and reward values from Qwen2.5-VL-7B/32B and Qwen3-VL-8B in the rewards/ directory. This allows you to directly evaluate DIG's performance.

1. Set the Target Model

export MODEL_NAME=Qwen/Qwen2.5-VL-7B-Instruct 
# Supported: Qwen/Qwen2.5-VL-32B-Instruct, Qwen/Qwen3-VL-8B-Instruct

2. Video Refinement (Key Frame Selection) Extract the optimal keyframes based on the pre-computed rewards.

# Usage: bash scripts/video_refinement.sh <dataset>
bash scripts/video_refinement.sh mlvu # Options: longvideobench, videomme

3. Run Evaluation Evaluate using the lmms-eval framework.

# Usage: bash scripts/eval/qwen25vl.sh <dataset> <method>

# For Qwen2.5-VL
bash scripts/eval/qwen25vl.sh mlvu DIG 

# For Qwen3-VL
bash scripts/eval/qwen3vl.sh mlvu DIG

Supported Methods:

DIG: Uses DIG pipeline.
UNI: Uses standard uniform sampling.

🛠️ Full Pipeline Workflow

If you wish to run the entire DIG process from scratch, please follow these steps.

Step 1: Query Identification

The LLM is used to analyzes the user query as Global or Localized.

# 1. Launch the LLM Server
export MODEL_NAME=Qwen/Qwen3-Next-80B-A3B-Instruct
bash scripts/launch_llm.sh

# 2. Run Identification
bash scripts/query_identification.sh mlvu # Options: longvideobench, videomme

Step 2: Content-Aware Frame Selection (CAFS)

We use DINOv2 to extract features and select representative "r-frames" from the video.

bash scripts/cafs.sh mlvu # Options: longvideobench, videomme

Step 3: Reward Assignment

The LMM is used to score the r-frames based on their relevance to the user query.

# 1. Launch the LMM Server (e.g., Qwen2.5-VL-7B)
export MODEL_NAME=Qwen/Qwen2.5-VL-7B-Instruct
bash scripts/launch_mllm.sh

# 2. Assign Rewards
bash scripts/reward_assignment.sh mlvu # Options: longvideobench, videomme
# Results are saved to the 'rewards/' directory.

Step 4: Video Refinement

Use the generated rewards to construct the final frame input for inference. (See the "Inference" section above for this step)

📂 Project Structure

DIG/
├── data/                   # Dataset storage
├── lmms-eval/              # Evaluation framework
├── pipeline/               # Core DIG implementation
│   ├── cafs.py             # Content-Aware Frame Selection
│   ├── query_identification.py # Global vs. Localized classification
│   ├── reward_assignment.py    # Frame relevance scoring
│   └── video_refinement.py     # Final frame selection 
├── rewards/                # Pre-computed metadata & rewards
├── scripts/                # execution scripts
│   ├── eval/               # Evaluation launchers
│   ├── launch_llm.sh       # vLLM Server for Query Identification
│   └── launch_mllm.sh      # vLLM Server for Reward Assignment
├── utils.py                # Dataset loader & Prompt templates
└── requirements.txt        # Python dependencies

🤝 Citation

If you find DIG useful for your research or projects, we would greatly appreciate it if you could cite our work:

@misc{li2025dividegroundadaptingframe,
      title={Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding}, 
      author={Jialuo Li and Bin Li and Jiahao Li and Yan Lu},
      year={2025},
      eprint={2512.04000},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.04000}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DIG:
Adapting Frame Selection to Query Types for Long-Form Video Understanding

📰 News

🚀 Quick Start

1. Installation

2. Data Preparation

3. Inference

🛠️ Full Pipeline Workflow

Step 1: Query Identification

Step 2: Content-Aware Frame Selection (CAFS)

Step 3: Reward Assignment

Step 4: Video Refinement

📂 Project Structure

🤝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
imgs		imgs
lmms-eval		lmms-eval
pipeline		pipeline
rewards		rewards
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

DIG:Adapting Frame Selection to Query Types for Long-Form Video Understanding

📰 News

🚀 Quick Start

1. Installation

2. Data Preparation

3. Inference

🛠️ Full Pipeline Workflow

Step 1: Query Identification

Step 2: Content-Aware Frame Selection (CAFS)

Step 3: Reward Assignment

Step 4: Video Refinement

📂 Project Structure

🤝 Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DIG:
Adapting Frame Selection to Query Types for Long-Form Video Understanding

Packages