Clarifying attention mask

I don't quite understand the attention mask in the way that it's implemented.

Here is the relevant line: https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L312 :
```python
...
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask

# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
...
```

So it seems the proper way to use `attention_mask` is to set the positions you want to keep to 1's, and positions you want to mask out to 0's.

Curious why we don't simply multiply instead of add and then normalize? Is it for stability reasons?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifying attention mask #542

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Clarifying attention mask #542

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions