VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick*, Li Jing*, Sayan Nag*, Jiachen Zhu, Hardik J Shah, Yann LeCun, Rama Chellappa
Transactions on Machine Learning Research (TMLR), 2023
arxiv | project page
TL;DR: We introduce VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new vision-language pre-training (VLP) paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive bounding box annotations.
- [October, 2023] We release the first version of the VoLTA codebase.
- [August, 2023] VoLTA is accepted by TMLR.
The contents of this repository are structured as follows:
VoLTA
├── Pre-training
├── Multimodal_Fine_Grained
│ │── REC
│ │── LVIS
│ │── COCO_det
├── Multimodal_Coarse_Grained
│ │── VQAv2
│ │── NLVR2
│ │── IRTR
│ │── Captioningconda create -n python=3.8.13 volta
conda activate volta
conda install pip
pip install -r requirements.txtThis repository is created and maintained by Shraman. Questions and discussions are welcome via spraman3@jhu.edu.
The codebase for this work is built on the FIBER, GOT and Barlow Twins repository. We would like to thank the respective authors for their contribution, and the Meta AI team for discussions and feedback. Shraman Pramanick and Rama Chellappa were partially supported by an ONR MURI Grant N00014-20-1-2787.
VoLTA is licensed under a MIT License.
@article{pramanick2023volta,
title={VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment},
author={Pramanick, Shraman and Jing, Li and Nag, Sayan and Zhu, Jiachen and Shah, Hardik and LeCun, Yann and Chellappa, Rama},
journal={Transactions on Machine Learning Research},
year={2023}
}
