Add an attention bias subclass for a lower right causal masking by drisspg · Pull Request #114823 · pytorch/pytorch

drisspg · 2023-11-30T02:23:27Z

Summary

This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean is_causal flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements.

The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: #108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of is_causal to represent lower_right causal masking.

The larger theme though is laid here: #110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be sliding_window which is used by the popular mistral model family.

Stack from ghstack (oldest at bottom):

-> Add an attention bias subclass for a lower right causal masking #114823

Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf.

+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
|  Type   |      Speedup       | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim |     dtype      | head_dim |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+
| Average | 1.2388050062214226 |            |           |           |           |           |                |          |
|   Max   | 1.831672915579016  |    128     |    32     |   1024    |   2048    |   2048    | torch.bfloat16 |    64    |
|   Min   | 0.9430534166730135 |     1      |    16     |    256    |    416    |   2048    | torch.bfloat16 |   128    |
+---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

[ghstack-poisoned]

pytorch-bot · 2023-11-30T02:23:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114823

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 80b03db with merge base 597d3fb ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Results from benchmarking ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.4320973389745877 | | | | | | | | | Max | 2.577393980674173 | 128 | 32 | 512 | 4097 | 2048 | torch.bfloat16 | 64 | | Min | 0.942474845104863 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` [ghstack-poisoned]

torch/nn/utils/attention.py

Results from benchmarking ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.4320973389745877 | | | | | | | | | Max | 2.577393980674173 | 128 | 32 | 512 | 4097 | 2048 | torch.bfloat16 | 64 | | Min | 0.942474845104863 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` [ghstack-poisoned]

docs/source/nn.utils.attention.rst

Results from benchmarking ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.4320973389745877 | | | | | | | | | Max | 2.577393980674173 | 128 | 32 | 512 | 4097 | 2048 | torch.bfloat16 | 64 | | Min | 0.942474845104863 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` [ghstack-poisoned]

torch/nn/utils/attention.py

torch/nn/functional.py

Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf. ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.2388050062214226 | | | | | | | | | Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 | | Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` [ghstack-poisoned]

…sking" # Summary This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements. The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: #108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking. The larger theme though is laid here: #110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family. Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf. ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.2388050062214226 | | | | | | | | | Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 | | Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` [ghstack-poisoned]

ghstack-source-id: cb25232 Pull Request resolved: #114823

drisspg · 2023-12-06T05:54:08Z

@pytorchbot merge

pytorchmergebot · 2023-12-06T05:56:08Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

drisspg · 2023-12-06T06:17:37Z

@pytorchbot merge

pytorchmergebot · 2023-12-06T06:19:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…rch#114823) # Summary This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements. The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: pytorch#108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking. The larger theme though is laid here: pytorch#110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family. Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf. ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Type | Speedup | batch_size | num_heads | q_seq_len | k_seq_len | embed_dim | dtype | head_dim | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ | Average | 1.2388050062214226 | | | | | | | | | Max | 1.831672915579016 | 128 | 32 | 1024 | 2048 | 2048 | torch.bfloat16 | 64 | | Min | 0.9430534166730135 | 1 | 16 | 256 | 416 | 2048 | torch.bfloat16 | 128 | +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` Pull Request resolved: pytorch#114823 Approved by: https://github.com/cpuhrsch

add an attention bias subclass for a lower right

f57b9a5

[ghstack-poisoned]

drisspg requested review from albanD, jbschlosser and mikaylagawarecki as code owners November 30, 2023 02:23

This was referenced Nov 30, 2023

Add Framework for defining new attn biases #114824

Closed

Add framework for adding new attn bias variants #113103

Closed

drisspg requested a review from cpuhrsch November 30, 2023 02:26

drisspg added 2 commits November 30, 2023 11:12

cpuhrsch approved these changes Dec 1, 2023

View reviewed changes

cpuhrsch reviewed Dec 1, 2023

View reviewed changes

torch/nn/utils/attention.py Outdated Show resolved Hide resolved

drisspg commented Dec 2, 2023

View reviewed changes

docs/source/nn.utils.attention.rst Outdated Show resolved Hide resolved

cpuhrsch reviewed Dec 2, 2023

View reviewed changes

torch/nn/utils/attention.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Dec 2, 2023

View reviewed changes

torch/nn/utils/attention.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Dec 2, 2023

View reviewed changes

torch/nn/utils/attention.py Outdated Show resolved Hide resolved

cpuhrsch reviewed Dec 2, 2023

View reviewed changes

torch/nn/functional.py Outdated Show resolved Hide resolved

drisspg added a commit that referenced this pull request Dec 6, 2023

add an attention bias subclass for a lower right

86556c9

ghstack-source-id: cb25232 Pull Request resolved: #114823

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 6, 2023

pytorchmergebot added the merging label Dec 6, 2023

pytorchmergebot removed the merging label Dec 6, 2023

drisspg added release notes: onnx torch.onnx related changes that should show up in the release notes release notes: python_frontend python frontend release notes category labels Dec 6, 2023

pytorchmergebot added the merging label Dec 6, 2023

pytorchmergebot added the Merged label Dec 6, 2023

pytorchmergebot closed this in d4c79a3 Dec 6, 2023

pytorchmergebot removed the merging label Dec 6, 2023

fxmarty mentioned this pull request Dec 7, 2023

F.scaled_dot_product_attention support huggingface/transformers#26572

Merged

facebook-github-bot deleted the gh/drisspg/12/head branch December 9, 2023 15:26

ehuaa mentioned this pull request Mar 27, 2024

MistralAttention: where is the sliding window huggingface/transformers#29777

Closed

warner-benjamin mentioned this pull request Apr 5, 2024

Re-enable SDPA's FA2 path huggingface/transformers#30070

Merged

fxmarty mentioned this pull request Jun 4, 2024

Reenable SDPA's FA2 During Training with torch.compile huggingface/transformers#30442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an attention bias subclass for a lower right causal masking#114823

Add an attention bias subclass for a lower right causal masking#114823
drisspg wants to merge 26 commits intogh/drisspg/12/basefrom
gh/drisspg/12/head

drisspg commented Nov 30, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drisspg commented Dec 6, 2023

Uh oh!

pytorchmergebot commented Dec 6, 2023

Uh oh!

drisspg commented Dec 6, 2023

Uh oh!

pytorchmergebot commented Dec 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

drisspg commented Nov 30, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114823

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drisspg commented Dec 6, 2023

Uh oh!

pytorchmergebot commented Dec 6, 2023

Merge failed

Uh oh!

drisspg commented Dec 6, 2023

Uh oh!

pytorchmergebot commented Dec 6, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drisspg commented Nov 30, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 30, 2023 •

edited

Loading