LLM-Augmented Visual Representation Learning

Official PyTorch implementation for paper Multi-Modal Large Language Models are Effective Vision Learners, accepted by WACV 2025.

[Paper] [Supplementary Material]

Abstract

Large language models (LLMs), pre-trained on vast amounts of text, have shown remarkable abilities in understanding general knowledge and commonsense. Therefore, it's desirable to leverage pre-trained LLM for help solve computer vision tasks. Previous works on multi-modal LLM mainly focus on the generation capability. In this work, we propose LLM-augmented visual representation learning (LMVR). Our approach involves initially using a vision encoder to extract features, which are then projected into the word embedding space of the LLM. The LLM then generates responses based on the visual representation and a text prompt. Finally, we aggregate sequence-level features from the hidden layers of the LLM to obtain image-level representations. We conduct extensive experiments on multiple datasets, and have the following findings: (a) LMVR outperforms traditional vision encoder on various downstream tasks, and effectively learns the correspondence between words and image regions; (b) LMVR improves the generalizability compared to using a vision encoder alone, as evidenced by its superior resistance to domain shift; (c) LMVR improves the robustness of models to corrupted and perturbed visual data. Our findings demonstrate LLM-augmented representation learning is effective as it learns object-level concepts and commonsense knowledge.

Requirements

Since our codebase is derived from LLaVA, we follow the same requirements as that project.

Install packages

conda create -n lmvr python=3.10 -y
conda activate lmvr
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Upgrade to latest code base

git pull
pip uninstall transformers
pip install -e .

Pre-Training

Please follow the instructions here.

Feature Extraction

sh run_extract.sh

Token-Level Visualization

visualization.ipynb

Citation

@InProceedings{Sun_2025_WACV,
    author    = {Sun, Li and Ahuja, Chaitanya and Chen, Peng and D'Zmura, Matt and Batmanghelich, Kayhan and Bontrager, Philip},
    title     = {Multi-Modal Large Language Models are Effective Vision Learners},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025},
    pages     = {8606-8615}
}

Reference

LLaVA: https://github.com/haotian-liu/LLaVA

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figures		figures
llava		llava
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run_extract.sh		run_extract.sh
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM-Augmented Visual Representation Learning

[Paper] [Supplementary Material]

Abstract

Requirements

Pre-Training

Feature Extraction

Token-Level Visualization

Citation

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM-Augmented Visual Representation Learning

[Paper] [Supplementary Material]

Abstract

Requirements

Pre-Training

Feature Extraction

Token-Level Visualization

Citation

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages