Add support for DeBERTa

# 🌟 New model addition

## Model description

DeBERTa (Decoding-enhanced BERT with disentangled attention) is a new model architecture:

>  In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pre-training and performance of downstream tasks. 

The paper can be found [here](https://arxiv.org/abs/2006.03654).

## Open source status

* [x] the model implementation is available: [GitHub](https://github.com/microsoft/DeBERTa)
* [x] the model weights are available: [GitHub release](https://github.com/microsoft/DeBERTa/releases/tag/v0.1)
* [ ] who are the authors: @BigBird01


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DeBERTa #4858

🌟 New model addition

Model description

Open source status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for DeBERTa #4858

Description

🌟 New model addition

Model description

Open source status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions