-
Notifications
You must be signed in to change notification settings - Fork 32.4k
Description
Environment info
transformersversion: 4.13.0- Platform:
- Python version: 3.8.5
- PyTorch version (GPU?): 1.10.0+cu102
- Tensorflow version (GPU?):
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help
Information
Model I am using (Bert, XLNet ...): T5
There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I've trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax.
However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn't be trained correctly.
Here is the code I mentioned: link
I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in the code.
To reproduce
Expected behavior
the function get_extended_attention_mask uses the smaller number to mask a tensor.