MegatronBertForMaskedLM

## Environment info


- `transformers` version: 4.12.5
- Platform: Linux
- Python version: 3.6
- PyTorch version (GPU?): 1.10.0+cu102
- Tensorflow version (GPU?):2.6
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

### Who can help

@LysandreJik @stas00 
## Information

Model I am using MegatronBERT from https://huggingface.co/nvidia/megatron-bert-uncased-345m. The model loads correctly in the following way : 
```
from transformers import BertTokenizer, MegatronBertModel
model = MegatronBertModel.from_pretrained("megatron_model_here")
```

but throws a RuntimeError for size mismatch while using MegatronBertForMaskedLM 

## To reproduce
`model = MegatronBertForMaskedLM.from_pretrained("megatron_model_here")
`
      
Error : 
```
RuntimeError: Error(s) in loading state_dict for MegatronBertForMaskedLM:
	size mismatch for cls.predictions.bias: copying a param with shape torch.Size([30592]) from checkpoint, the shape in current model is torch.Size([29056])
```



## Expected behavior

MaskedLM model loads properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MegatronBertForMaskedLM #16638

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MegatronBertForMaskedLM #16638

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions