A potential bug in ModuleUtilsMixin.get_extended_attention_mask

## Environment info


- `transformers` version: 4.13.0
- Platform: 
- Python version: 3.8.5
- PyTorch version (GPU?): 1.10.0+cu102
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:

### Who can help
@LysandreJik 


## Information

Model I am using (Bert, XLNet ...): T5

There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I've trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax. 
However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn't be trained correctly.

Here is the code I mentioned: [link](https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_utils.py#L295-L302)

I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in [the code](https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_utils.py#L233-L236).


## To reproduce




## Expected behavior

the function get_extended_attention_mask uses the smaller number to mask a tensor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A potential bug in ModuleUtilsMixin.get_extended_attention_mask #14859

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A potential bug in ModuleUtilsMixin.get_extended_attention_mask #14859

Description

Environment info

Who can help

Information

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions