TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Shicheng Li; Lei Li; Kun Ouyang; Shuhuai Ren; Yuanxin Liu; Yuanxing Zhang; Fuzheng Zhang; Lingpeng Kong; Qi Liu; Xu Sun

doi:10.1609/aaai.v40i8.37565

Authors

Shicheng Li National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Lei Li The University of Hong Kong
Kun Ouyang National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Shuhuai Ren National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Yuanxin Liu National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
Yuanxing Zhang Kling Team
Fuzheng Zhang Kuaishou Technology
Lingpeng Kong The University of Hong Kong
Qi Liu The University of Hong Kong
Xu Sun National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i8.37565

Abstract

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information