VideoDetective is a plug-and-play inference framework for long-video understanding that integrates extrinsic query relevance with intrinsic video structure.
Keywords: long video understanding, video question answering, multimodal large language models
By modeling the video as a Spatio-Temporal Affinity Graph, it performs an iterative Hypothesis-Verification-Refinement loop to propagate relevance signals from sparse observations to the entire video. This allows the model to "See Less but Know More", accurately localizing critical clues for complex reasoning under limited context budgets.
This repository contains a runnable demo script: scripts/test_run.py.
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video’s intrinsic structure and varying relevance across segments. Motivated by this, VideoDetective jointly leverages the query and intrinsic inter-segment correlations to model a query-relevance distribution over the entire video under a limited observation budget (“See Less but Know More”).
- Visual–temporal affinity graph: divide a video into segments and represent them as a visual–temporal affinity graph built from visual similarity and temporal proximity.
- Hypothesis–Verification–Refinement loop: estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides localization of the most critical segments for final answering with sparse observation.
VideoDetective consistently achieves substantial gains across a wide range of mainstream MLLMs.
- Python: recommended 3.9+
- ffmpeg: required for audio extraction if ASR is enabled
- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt install ffmpeg
- macOS:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt- Copy the template:
cp .env.example .env- Fill in at least the VLM API settings (OpenAI-compatible):
VIDEODETECTIVE_API_KEYVIDEODETECTIVE_BASE_URL(e.g., DashScope / OpenAI / OpenRouter compatible base URL)VIDEODETECTIVE_VLM_MODEL(e.g.,qwen3-vl-8b-instruct)
Others:
- Text LLM (planner / query decomposition) (falls back to VLM settings if not set)
VIDEODETECTIVE_LLM_MODELVIDEODETECTIVE_LLM_API_KEYVIDEODETECTIVE_LLM_BASE_URL
- Pipeline
VIDEODETECTIVE_MAX_FRAMES_PER_CALLENABLE_MULTI_ROUTE_RECALLUSE_VLM_RELEVANCEINCLUDE_ANSWER_EVIDENCE
- ASR
VIDEODETECTIVE_ENABLE_ASRVIDEODETECTIVE_WHISPER_MODELVIDEODETECTIVE_ASR_DEVICE
Notes:
- Environment loading is implemented in
config/settings.pyand reads.envfrom the project root. src/agent/llm_client.pysupports custom auth headers for some OpenAI-compatible proxies via:VIDEODETECTIVE_AUTH_HEADER_NAMEVIDEODETECTIVE_AUTH_PREFIX
python scripts/test_run.py \
--video_path /path/to/video.mp4 \
--question "What is the man doing?" \
--options "A. Running, B. Walking, C. Sitting, D. Standing" \
--output_dir output \
--max_steps 10 \
--total_budget 32For each run, you should get:
- Full results:
output/<video_id>_results.json(includes prediction and entire processing information)
Minimal usage in Python:
from src.pipeline import VideoDetective
detective = VideoDetective(verbose=True)
result = detective.solve(
video_path="/path/to/video.mp4",
query="Question text. Options: A. ..., B. ..., C. ..., D. ...",
max_steps=10,
total_budget=32,
)
print(result.answer)
# result.debug_info contains debugging artifacts such as belief history.@misc{yang2026videodetective,
title = {VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding},
author = {Yang , Ruoliu and Wu, Chu and Shan , Caifeng and He , Ran and Fu , Chaoyou},
journal={arXiv preprint arXiv:2603.22285},
year = {2026}
}

