VideoDetective

VideoDetective is a plug-and-play inference framework for long-video understanding that integrates extrinsic query relevance with intrinsic video structure.

Keywords: long video understanding, video question answering, multimodal large language models

By modeling the video as a Spatio-Temporal Affinity Graph, it performs an iterative Hypothesis-Verification-Refinement loop to propagate relevance signals from sparse observations to the entire video. This allows the model to "See Less but Know More", accurately localizing critical clues for complex reasoning under limited context budgets.

This repository contains a runnable demo script: scripts/test_run.py.

📚 Contents

Overview
- Motivation
- Key ideas
Results
Example
Installation
- Requirements
- Install dependencies
Configuration
Inference
- Quick start (single video)
- Outputs
Core API
Citation

🧭 Overview

🎯 Motivation

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. Motivated by this, VideoDetective jointly leverages the query and intrinsic inter-segment correlations to model a query-relevance distribution over the entire video under a limited observation budget (“See Less but Know More”).

💡 Key ideas

Visual–temporal affinity graph: divide a video into segments and represent them as a visual–temporal affinity graph built from visual similarity and temporal proximity.
Hypothesis–Verification–Refinement loop: estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides localization of the most critical segments for final answering with sparse observation.

📈 Results

VideoDetective consistently achieves substantial gains across a wide range of mainstream MLLMs.

🖼️ Example

🛠️ Installation

✅ Requirements

Python: recommended 3.9+
ffmpeg: required for audio extraction if ASR is enabled
- macOS: brew install ffmpeg
- Ubuntu/Debian: sudo apt install ffmpeg

📦 Install dependencies

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

⚙️ Configuration

Copy the template:

cp .env.example .env

Fill in at least the VLM API settings (OpenAI-compatible):

VIDEODETECTIVE_API_KEY
VIDEODETECTIVE_BASE_URL (e.g., DashScope / OpenAI / OpenRouter compatible base URL)
VIDEODETECTIVE_VLM_MODEL (e.g., qwen3-vl-8b-instruct)

Others:

Text LLM (planner / query decomposition) (falls back to VLM settings if not set)
- VIDEODETECTIVE_LLM_MODEL
- VIDEODETECTIVE_LLM_API_KEY
- VIDEODETECTIVE_LLM_BASE_URL
Pipeline
- VIDEODETECTIVE_MAX_FRAMES_PER_CALL
- ENABLE_MULTI_ROUTE_RECALL
- USE_VLM_RELEVANCE
- INCLUDE_ANSWER_EVIDENCE
ASR
- VIDEODETECTIVE_ENABLE_ASR
- VIDEODETECTIVE_WHISPER_MODEL
- VIDEODETECTIVE_ASR_DEVICE

Notes:

Environment loading is implemented in config/settings.py and reads .env from the project root.
src/agent/llm_client.py supports custom auth headers for some OpenAI-compatible proxies via:
- VIDEODETECTIVE_AUTH_HEADER_NAME
- VIDEODETECTIVE_AUTH_PREFIX

🚀 Inference

⚡ Quick start (single video)

python scripts/test_run.py \
  --video_path /path/to/video.mp4 \
  --question "What is the man doing?" \
  --options "A. Running, B. Walking, C. Sitting, D. Standing" \
  --output_dir output \
  --max_steps 10 \
  --total_budget 32

🗂️ Outputs

For each run, you should get:

Full results: output/<video_id>_results.json (includes prediction and entire processing information)

🧪 Core API

Minimal usage in Python:

from src.pipeline import VideoDetective

detective = VideoDetective(verbose=True)
result = detective.solve(
    video_path="/path/to/video.mp4",
    query="Question text. Options: A. ..., B. ..., C. ..., D. ...",
    max_steps=10,
    total_budget=32,
)
print(result.answer)
# result.debug_info contains debugging artifacts such as belief history.

✒️ Citation

@misc{yang2026videodetective,
  title = {VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding},
  author = {Yang , Ruoliu and Wu, Chu and Shan , Caifeng and He , Ran and Fu , Chaoyou},
  journal={arXiv preprint arXiv:2603.22285},
  year = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
images		images
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoDetective

📚 Contents

🧭 Overview

🎯 Motivation

💡 Key ideas

📈 Results

🖼️ Example

🛠️ Installation

✅ Requirements

📦 Install dependencies

⚙️ Configuration

🚀 Inference

⚡ Quick start (single video)

🗂️ Outputs

🧪 Core API

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoDetective

📚 Contents

🧭 Overview

🎯 Motivation

💡 Key ideas

📈 Results

🖼️ Example

🛠️ Installation

✅ Requirements

📦 Install dependencies

⚙️ Configuration

🚀 Inference

⚡ Quick start (single video)

🗂️ Outputs

🧪 Core API

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages