Adding Vision Transformer to torchvision/models

### 🚀 The feature

1. Adding ViT architecture from this paper: "[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)"
2. Adding DeiT architecture from this paper: "[Training data-efficient image transformers & distillation through attention](https://arxiv.org/abs/2012.12877)"


@fmassa @datumbox @mannatsingh @kazhang 

### Motivation, pitch

Vision Transformer models should exist in torchvision repo because they are good models :)

I'm currently working on this project.


### Additional context

We can also consider adding some techniques from the following papers  ^^
For example, adding Conv stem for ViT, see details in "[Early Convolutions Help Transformers See Better](https://arxiv.org/abs/2106.14881)" 

References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py

cc @datumbox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Vision Transformer to torchvision/models #4593

🚀 The feature

Motivation, pitch

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adding Vision Transformer to torchvision/models #4593

Description

🚀 The feature

Motivation, pitch

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions