Create fastpath backend context manager, similar to SDPA kernel backend manager#107163
Create fastpath backend context manager, similar to SDPA kernel backend manager#107163mikekgfb wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/107163
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 47d7333 with merge base 1d95644 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
4c9e130 to
7111078
Compare
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
7111078 to
3396504
Compare
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
3396504 to
93d6b22
Compare
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
93d6b22 to
596db1a
Compare
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
e50e303 to
d3b27fd
Compare
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
4 similar comments
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
There was a problem hiding this comment.
Hey @mikekgfb, I haven't done an in-depth review of this but I have two broad questions
I take it that the intent is for this context manager to be able to disable the fastpaths for nn.MHA and nn.Transformer. However, while the flags in the context manager for SDPA toggle between the 3 backends that implement SDPA (math, mem-eff and flash), it appears to me like the flags in this context manager might be a bit more nuanced.
-
is the term
ATFPwidely used to refer to the combination ofnn.MHA/nn.Transformer? I am curious about the rationale for combining these into one single context manager as well as whether the naming will make this discoverable? -
Wanted to understand the meanings of the kwargs to this context manager. Granted that this is probably intended to be a tool for power users who have good context on this but I think it is important that we clearly establish the rationale/use-cases for each the arguments to understand the design here.
math: mirrors the context manager for sdpaenable_nested_tensor: gives the ability to override theTransformerEncoder(enable_nested_tensor)flag given at construction time during runtime (enable_mhagives the ability to disable the MHA sparsity fast path during runtime (doesn't have a corresponding flag in MHA constructor)enable_encoderdisables the sparsity fast path forTransformerEncoderLayerat runtime (seems like naming might be confusing here)
What is the rationale for including both flags that override arguments to constructors of certain modules as well as other flags that directly disable the fast path for certain modules?
All flags disable the mode dynamically, starting and ending at the scope of the context manager, regardless of whatever settings. IN some ways you might think of that similar to how SDPA context manager works -- it enables or disables specific kernels, but they can also be not accessible on some hardware etc. So context manager is one of many decision criteria, but it's a surefire way to disable it. (The context manager check is always done in forward, so the operation of the context manager is completely dynamic based on scoping with the context manager, similar to what the SDPA context manager does. So you can, for example, build a model that uses enable_nested_tensors=True, and can run with nested tensors, but you can also turn that behavior off without rebuilding the model by using the config manager. ) reasons why users might want to disable fastpath: 1 - it's broken (hopefully not) 2 - performance is not good 3 - numerical equivalence is important (this was the original starting point for building this -- #106668 enables fastpath for inputs that are 4 - users don't want the fastpath for other reasons, e.g., the recent #106824. We can keep adding additional conditions for each case, or just have a framework for user control. The naming is an interesting question - fp stands for the (inference) "fastpath". We have captured all these features as Accelerated Transformer (previously known as Better Transformer) - so that's where the at comes from to make allowance that "fastpath" isn't super descriptive. I think they fit reasonably well together, and give users control over what e2e consists of interlocking features. I can see other ways to dividing them, or I could also see the case for creating it as PS: it's very deliberately modeled after https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html to give users a cognitive familiarity rather than introducing a new way of handling this. |
|
@mikaylagawarecki I’ll be appreciate to get your suggestion for naming. I agree that ATFP isn’t particularly obvious - I was concerned that fastpath alone was too generic (there are many operators that might have a fastpath. Totally not clear it refers to transformers. We pulso call it the longer name transformer_fastpath_
You asked about why some operators were disabling construction time settings. I think when we first introduced these settings, we wanted to give users control. Since then it turns out that this may not be enough, i.e., a model might be constructed once, but then users may want to run it without the fastpath.
Two examples:
A - A recent change disabled fastpath diff scripted models because on edge devices, they don’t include the operator. Takeaway: they build the model with fastpath (or received one that was so built), and the current “fix” disables the fastest for every torchscript user of i understand things correctly!
B - For the FSDP unit tests, the model is built with enable nested tensor, but accuracy checks are performed to check bitwise accuracy. That is bound to fail because it will trigger different rounding even if the execution itself is mathematically equivalent. See fail for #106668 which is resolved by the present PR.
|
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
6 similar comments
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
1 similar comment
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
…nd manager (pytorch#107163) Summary: Create fastpath backend context manager, similar to SDPA kernel backend manager ghstack-source-id: 208858046 exported-using-ghexport Test Plan: sandcastle, github Reviewed By: osalpekar Differential Revision: D48325593
|
This pull request was exported from Phabricator. Differential Revision: D48325593 |
Summary:
desired startup/default polarity of a flag)
(similar to SDPA kernel backend manager, to give users instant familiarity with the mechanism)
otherwise break this FSDP test when the executions are performed using different kernels, showing divergence in
the test, not thru an error, but the use of different kernels, with different FP rounding characteristics etc)
Test Plan: sandcastle, github
Differential Revision: D48325593
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @kiukchung @d4l3k @LucasLLC