TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Authors

  • Shicheng Li National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
  • Lei Li The University of Hong Kong
  • Kun Ouyang National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
  • Shuhuai Ren National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
  • Yuanxin Liu National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
  • Yuanxing Zhang Kling Team
  • Fuzheng Zhang Kuaishou Technology
  • Lingpeng Kong The University of Hong Kong
  • Qi Liu The University of Hong Kong
  • Xu Sun National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University

DOI:

https://doi.org/10.1609/aaai.v40i8.37565

Abstract

Video Large Language Models (Video LLMs) have achieved significant success by adopting the paradigm of large-scale pre-training followed by supervised fine-tuning (SFT). However, existing approaches struggle with temporal reasoning due to weak temporal correspondence in the data and over-reliance on the next-token prediction paradigm, which collectively result in the absence temporal supervision. To address these limitations, we propose TEMPLE (TEMporal Preference Learning), a systematic framework that enhances temporal reasoning capabilities through Direct Preference Optimization (DPO). To address temporal information scarcity in data, we introduce an automated pipeline for systematically constructing temporality-intensive preference pairs comprising three steps: selecting temporally rich videos, designing video-specific perturbation strategies, and evaluating model responses on clean and perturbed inputs. Complementing this data pipeline, we provide additional supervision signals via preference learning and propose a novel Progressive Pre-SFT Alignment strategy featuring two key innovations: a curriculum learning strategy which progressively increases perturbation difficulty to maximize data efficiency; and applying preference optimization before instruction tuning to incentivize fundamental temporal alignment. Extensive experiments demonstrate that our approach consistently improves Video LLM performance across multiple benchmarks with a relatively small set of self-generated DPO data. Our findings highlight TEMPLE as a scalable and efficient complement to SFT-based methods, paving the way for developing reliable Video LLMs.

Downloads

Published

2026-03-14

How to Cite

Li, S., Li, L., Ouyang, K., Ren, S., Liu, Y., Zhang, Y., … Sun, X. (2026). TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment. Proceedings of the AAAI Conference on Artificial Intelligence, 40(8), 6378–6386. https://doi.org/10.1609/aaai.v40i8.37565

Issue

Section

AAAI Technical Track on Computer Vision V