### 🚀 The feature 1. Adding ViT architecture from this paper: "[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)" 2. Adding DeiT architecture from this paper: "[Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)" @fmassa @datumbox @mannatsingh @kazhang ### Motivation, pitch Vision Transformer models should exist in torchvision repo because they are good models :) I'm currently working on this project. ### Additional context We can also consider adding some techniques from the following papers ^^ For example, adding Conv stem for ViT, see details in "[Early Convolutions Help Transformers See Better](https://arxiv.org/abs/2106.14881)" References: https://github.com/google-research/vision_transformer https://github.com/facebookresearch/deit https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py cc @datumbox
🚀 The feature
@fmassa @datumbox @mannatsingh @kazhang
Motivation, pitch
Vision Transformer models should exist in torchvision repo because they are good models :)
I'm currently working on this project.
Additional context
We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"
References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py
cc @datumbox