GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

Highlights

We investigate different design choices to extend the context window of existing VLMs to 128K while maintaining comparable performance on short visual tasks.
We conduct comprehensive analysis on decision-making process to validate the effective of our recipes. Technically, M-RoPE++ and hybrid-resolution training methods are newly proposed by us to enhance model performance during training and inference.
On existing long VLM benchmarks, GIRAFFE achieves state-of-the-art performance among similar scale open-sourced long VLMs and is competitive to commercial models.

Installation

Our model extends Qwen2-VL. For detailed information about the base model, please refer to their repository.

Install the required dependencies:

pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830 accelerate
pip install qwen-vl-utils[decord]

Quick Start

To enable M-ROPE++ and hybrid-resolution features, you have two options:

Option 1: Direct File Replacement

Replace the following files in your local installation:

Replace models/modeling_qwen2_vl.py in your local transformers and qwen-vl-utils with our models/vision_process.py

Option 2: Monkey Patch

Import our patch file before using the model:

for mrope++

from mrope_plus_monkey_patch import enable_mrope_plus

# Enable mRoPE++
enable_mrope_plus()

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

for hybrid-resolution

from hybrid_res_monkey_patch import enable_hybrid_resolution
enable_hybrid_resolution()
from qwen_vl_utils import process_vision_info

Citation

If you find our work useful, please cite:

@misc{li2024giraffedesignchoicesextending,
      title={GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models}, 
      author={Mukai Li and Lei Li and Shansan Gong and Qi Liu},
      year={2024},
      eprint={2412.12735},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.12735}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
models		models
utils		utils
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

Highlights

Installation

Quick Start

Option 1: Direct File Replacement

Option 2: Monkey Patch

Citation

About

Uh oh!

Releases

Packages

Languages

License

kiaia/GIRAFFE

Folders and files

Latest commit

History

Repository files navigation

GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models

Highlights

Installation

Quick Start

Option 1: Direct File Replacement

Option 2: Monkey Patch

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages