Skip to content

lisun-ai/LMVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM-Augmented Visual Representation Learning

Official PyTorch implementation for paper Multi-Modal Large Language Models are Effective Vision Learners, accepted by WACV 2025.

Abstract

Large language models (LLMs), pre-trained on vast amounts of text, have shown remarkable abilities in understanding general knowledge and commonsense. Therefore, it's desirable to leverage pre-trained LLM for help solve computer vision tasks. Previous works on multi-modal LLM mainly focus on the generation capability. In this work, we propose LLM-augmented visual representation learning (LMVR). Our approach involves initially using a vision encoder to extract features, which are then projected into the word embedding space of the LLM. The LLM then generates responses based on the visual representation and a text prompt. Finally, we aggregate sequence-level features from the hidden layers of the LLM to obtain image-level representations. We conduct extensive experiments on multiple datasets, and have the following findings: (a) LMVR outperforms traditional vision encoder on various downstream tasks, and effectively learns the correspondence between words and image regions; (b) LMVR improves the generalizability compared to using a vision encoder alone, as evidenced by its superior resistance to domain shift; (c) LMVR improves the robustness of models to corrupted and perturbed visual data. Our findings demonstrate LLM-augmented representation learning is effective as it learns object-level concepts and commonsense knowledge.

Requirements

Since our codebase is derived from LLaVA, we follow the same requirements as that project.

  1. Install packages
conda create -n lmvr python=3.10 -y
conda activate lmvr
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation
  1. Upgrade to latest code base
git pull
pip uninstall transformers
pip install -e .

Pre-Training

Please follow the instructions here.

Feature Extraction

sh run_extract.sh

Token-Level Visualization

visualization.ipynb

Citation

@InProceedings{Sun_2025_WACV,
    author    = {Sun, Li and Ahuja, Chaitanya and Chen, Peng and D'Zmura, Matt and Batmanghelich, Kayhan and Bontrager, Philip},
    title     = {Multi-Modal Large Language Models are Effective Vision Learners},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025},
    pages     = {8606-8615}
}

Reference

LLaVA: https://github.com/haotian-liu/LLaVA

About

Official Pytorch implementation for LMVR, accepted to WACV 2025

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors