Skip to content

A potential bug in ModuleUtilsMixin.get_extended_attention_mask #14859

@jk-jung

Description

@jk-jung

Environment info

  • transformers version: 4.13.0
  • Platform:
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.10.0+cu102
  • Tensorflow version (GPU?):
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help

@LysandreJik

Information

Model I am using (Bert, XLNet ...): T5

There is a potential bug in ModuleUtilsMixin.get_extended_attention_mask, and actually, it has happened to me while I've trained the T5 model from scratch. In the function, it masks a tensor by setting a large negative number(-1e-4), since it will be added to the raw scores before the softmax.
However, occasionally the value -1e4 is not small enough to nullify the scores in masked positions. In my case, some values in the raw scores before the softmax were small then -1e4 during training, so the model couldn't be trained correctly.

Here is the code I mentioned: link

I think you use -1e4 because of fp16 compatibility, then how about dividing the case based on dtype like in the code.

To reproduce

Expected behavior

the function get_extended_attention_mask uses the smaller number to mask a tensor.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions