VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick*, Li Jing*, Sayan Nag*, Jiachen Zhu, Hardik J Shah, Yann LeCun, Rama Chellappa
Transactions on Machine Learning Research (TMLR), 2023
arxiv | project page

TL;DR: We introduce VoLTA (Vision-Language Transformer with weakly-supervised local-feature Alignment), a new vision-language pre-training (VLP) paradigm that only utilizes image-caption data but achieves fine-grained region-level image understanding, eliminating the use of expensive bounding box annotations.

📢 News

[October, 2023] We release the first version of the VoLTA codebase.
[August, 2023] VoLTA is accepted by TMLR.

📁 Repository Structure

The contents of this repository are structured as follows:

VoLTA
    ├── Pre-training
    ├── Multimodal_Fine_Grained
    │   │── REC
    │   │── LVIS
    │   │── COCO_det
    ├── Multimodal_Coarse_Grained
    │   │── VQAv2
    │   │── NLVR2
    │   │── IRTR
    │   │── Captioning

🛠️ Environment Preparation

conda create -n python=3.8.13 volta
conda activate volta
conda install pip
pip install -r requirements.txt

✉️ Contact

This repository is created and maintained by Shraman. Questions and discussions are welcome via spraman3@jhu.edu.

🙏 Acknowledgements

The codebase for this work is built on the FIBER, GOT and Barlow Twins repository. We would like to thank the respective authors for their contribution, and the Meta AI team for discussions and feedback. Shraman Pramanick and Rama Chellappa were partially supported by an ONR MURI Grant N00014-20-1-2787.

📄 License

VoLTA is licensed under a MIT License.

🎓 Citation

@article{pramanick2023volta,
  title={VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment},
  author={Pramanick, Shraman and Jing, Li and Nag, Sayan and Zhu, Jiachen and Shah, Hardik and LeCun, Yann and Chellappa, Rama},
  journal={Transactions on Machine Learning Research},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Figures		Figures
Multimodal_Coarse_Grained		Multimodal_Coarse_Grained
Multimodal_Fine_Grained		Multimodal_Fine_Grained
Pre-training		Pre-training
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

📢 News

📁 Repository Structure

🛠️ Environment Preparation

✉️ Contact

🙏 Acknowledgements

📄 License

🎓 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment

📢 News

📁 Repository Structure

🛠️ Environment Preparation

✉️ Contact

🙏 Acknowledgements

📄 License

🎓 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages