Skip to content

HYUNJS/STTM

Repository files navigation

Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs (ICCV 2025)

arXiv | Project Page

TL;DR

STTM is a training-free spatio-temporal token merging method that supports KV-cache reuse. It operates in two steps: (1) Spatial merging based on a quadtree structure; (2) Temporal merging of multi-granular spatial tokens;

STTM is validated using three models: LLaVA-Video-7B/72B, LLaVA-OneVision-7B, and Qwen2VL-7B. Evaluation is conducted across six video QA benchmarks:

  • NIAH: VNBench
  • Long videos: Video-MME; LongVideoBench; MLVU
  • Short videos: NExT-QA; EgoSchema

Update

  • 🗓️ Coming in August 2025: Token merging demo code will be released - stay tuned!
  • 🗓️ July 26, 2025: Code is now available!
  • 🗓️ June 26, 2025: STTM is accepted to ICCV 2025

Environment Setup

git clone https://github.com/HYUNJS/STTM.git
cd STTM

## (1) Option with conda. I used virtualenv for experimentation.
conda create -n sttm python=3.10 -y
conda activate sttm

pip install -e ".[train]" --extra-index-url https://download.pytorch.org/whl/cu121  # for cu121 - default is cu124
pip install flash-attn==2.7.3 --no-build-isolation # compatible version with torch==2.5.1

🗂️ Dataset Setup

Please prepare the checkpoints in ./ckpts/ folder.

The datasets are organized as follows:

datasets/
├── egoschema/
├── longvideobench/
├── mlvu/
├── nextqa/
├── videomme/
└── vnbench/
    ├── annotations/
    ├── videos/ (Optional) for feature extraction and visualization
    └── preprocess_data/
        ├── {model_name}/
        │   └── {frame_sampling_name}/
        │       ├── features/
        │       │   └── {vid}.pt
        │       └── metadata/
        │           └── {vid}.pkl
        └── llava-video-7b-qwen2-video-only/
            └── F-180_fps-1/
                ├── features/
                │   └── 10109006686_cnt_edit1.pt
                └── metadata/
                    └── 10109006686_cnt_edit1.pkl
  • Each benchmark (e.g., egoschema, longvideobench, etc.) has its own folder.
  • videos/: Raw video files (can be removed after feature extraction).
  • annotations/: Contains annotation files (some are reformatted) for the benchmark. We format some benchmarks and save in sttm_annotations/ folder. Please copy it to setup the datasets.
  • preprocess_data/: Stores preprocessed features and metadata.
  • Model-specific preprocessed data is stored in the {model_name}/ folder. llava-video-7b-qwen2-video-only/ is example of model directory.
  • {frame_sampling_name}/: Name of frame sampling strategy used for feature extraction (e.g., F-128_fps-1 or F-180_fps-1).
  • features/: Extracted video features ({vid.pt}).
  • metadata/: Associated metadata ({vid.pkl}).

To help you get started easily, we provide preprocessed feature data for Video-MME and VNBench on HuggingFace. Each dataset includes multiple frame sampling setups (e.g., F-64_fps-1, F-128_fps-1). Please use the Hugging Face Hub API to selectively download only the configurations you need.

📁 File Structure

The project is organized into modular components for token merging, model adaptation, and evaluation. Below is a brief overview of the key directories and scripts:

  • token_merging_utils/
    Core implementations of the token merging algorithms.

  • token_merging_monkey_patch/
    Monkey patch files for injecting token merging into intermediate LLM layers of LLaVA-Video and LLaVA-OneVision models.

  • token_merging_qwen2vl_monkey_patch/
    Monkey patch files tailored for the Qwen2VL model.

  • llava/eval/video_feat_{model_name}.py
    Video feature extractor script.
    ➤ Example: video_feat_llavavideo.py

  • llava/eval/eval_vidqa_by_feat_{model_name}.py
    Video QA evaluation using pre-extracted features.

  • llava/eval/eval_vidqa_by_video_{model_name}.py
    Video QA evaluation directly from raw video input.

  • llava/eval/metric_{dataset_name}.py
    Metric computation scripts specific to each dataset.
    ➤ Example: metric_vnbench.py, metric_videomme.py

🏃‍♂️ How to Run

🔹 Frame Extraction

To extract video frames and features, refer to the following script:

  • scripts/eval/run_feat_extr.sh – Example commands for running feature extraction.

🔹 Reproducible Evaluation

For reproducible results, we provide a --reproduce flag that sets a fixed random seed and enables deterministic CUDA operations.

  • scripts/eval/run_vidqa.sh – Contains example commands for video QA evaluation with reproducibility enabled. The basic running format is:
CUDA_VISIBLE_DEVICES=${device} \
python llava/eval/eval_vidqa_by_feat_{model_name}.py \
  --reproduce \
  ${<data_loader_cfg>} \
  ${<model_cfg>} \
  ${<token_reduction_cfg>}

Citation

If you find this project helpful for your research or applications, please cite our paper:

@article{hyun2025multi,
  title={Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs},
  author={Hyun, Jeongseok and Hwang, Sukjun and Han, Su Ho and Kim, Taeoh and Lee, Inwoong and Wee, Dongyoon and Lee, Joon-Young and Kim, Seon Joo and Shim, Minho},
  journal={arXiv preprint arXiv:2507.07990},
  year={2025}
}

Acknowledgement 🙏

We would like to thank the authors of the following projects for their valuable contributions, which our work builds upon or references:

  • LLaVA-NeXT: We use its codebase for the LLaVA architecture, including the llava-video and llava-onevision models.
  • ToMe, DyCoke, and FrameFusion: These codebases are used as references for our baseline experiments.

About

[ICCV-2025] Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published