Skip to content

Clarifying attention mask #542

@hadsed

Description

@hadsed

I don't quite understand the attention mask in the way that it's implemented.

Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :

...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...

So it seems the proper way to use attention_mask is to set the positions you want to keep to 1's, and positions you want to mask out to 0's.

Curious why we don't simply multiply instead of add and then normalize? Is it for stability reasons?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions