🤗 Hugging Face | 📑 Paper
- We investigate different design choices to extend the context window of existing VLMs to 128K while maintaining comparable performance on short visual tasks.
- We conduct comprehensive analysis on decision-making process to validate the effective of our recipes. Technically, M-RoPE++ and hybrid-resolution training methods are newly proposed by us to enhance model performance during training and inference.
- On existing long VLM benchmarks, GIRAFFE achieves state-of-the-art performance among similar scale open-sourced long VLMs and is competitive to commercial models.
Our model extends Qwen2-VL. For detailed information about the base model, please refer to their repository.
Install the required dependencies:
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate
pip install qwen-vl-utils[decord]To enable M-ROPE++ and hybrid-resolution features, you have two options:
Replace the following files in your local installation:
- Replace
models/modeling_qwen2_vl.pyin your local transformers and qwen-vl-utils with ourmodels/vision_process.py
Import our patch file before using the model:
- for mrope++
from mrope_plus_monkey_patch import enable_mrope_plus
# Enable mRoPE++
enable_mrope_plus()
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor- for hybrid-resolution
from hybrid_res_monkey_patch import enable_hybrid_resolution
enable_hybrid_resolution()
from qwen_vl_utils import process_vision_infoIf you find our work useful, please cite:
@misc{li2024giraffedesignchoicesextending,
title={GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models},
author={Mukai Li and Lei Li and Shansan Gong and Qi Liu},
year={2024},
eprint={2412.12735},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.12735},
}
