`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims...

... SDPA causal mask generation may be wrong for the mask generation.

https://github.com/huggingface/transformers/blob/76fa17c1663a0efeca7208c20579833365584889/src/transformers/modeling_attn_mask_utils.py#L421-L433

Will it be safe to just return `None` for the `else:` case?

For causal attention, we can just use `_prepare_4d_causal_attention_mask_for_sdpa`

Related issues:
https://github.com/pytorch/pytorch/issues/108108
https://github.com/Dao-AILab/flash-attention/commit/9e5e8bc91e30af5cdc321362b553f6c0da332e30
https://github.com/huggingface/transformers/pull/28802

	if torch.all(mask == 1):
	if is_tracing:
	pass
	elif tgt_len == 1:
	# For query_length == 1, causal attention and bi-directional attention are the same.
	return None
	elif key_value_length == tgt_len:
	return None
	else:
	# Unfortunately, for query_length > 1 and key_value_length != query_length, we can not generally ignore the attention mask, as SDPA causal mask generation
	# may be wrong. We will set is_causal=False in SDPA and rely on Transformers attention_mask instead, hence not setting it to None here.
	# Reference: https://github.com/pytorch/pytorch/issues/108108
	return AttentionMaskConverter._expand_mask(mask=mask, dtype=dtype, tgt_len=tgt_len)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_prepare_4d_attention_mask_for_sdpa is not for causal attention but claims... #30095

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`_prepare_4d_attention_mask_for_sdpa` is not for causal attention but claims... #30095