Skip to content

[pytorch] Disable fast path in MultiheadAttention in Export#106824

Closed
guangy10 wants to merge 1 commit intopytorch:mainfrom
guangy10:export-D48169806
Closed

[pytorch] Disable fast path in MultiheadAttention in Export#106824
guangy10 wants to merge 1 commit intopytorch:mainfrom
guangy10:export-D48169806

Conversation

@guangy10
Copy link
Contributor

@guangy10 guangy10 commented Aug 8, 2023

Summary:
We are seeing aten._native_multi_head_attention op (not in core Aten op set) is left in the exported graph and causes problems in the downstream at runtime.

Two proposed solutions:

  1. Disable fast path while tracing to leverage the non-optimized path to get decomp, that way, the blamed op won't show up in the exported graph
  2. Add a decomp rule for aten._native_multi_head_attention

After discussing with kimishpatel and bdhirsh, #1 is preferred and verified it could immediately unblock the critical model enablement work for PP.

Test Plan: CI

Differential Revision: D48169806

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 8, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106824

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 23 New Failures, 3 Unrelated Failures

As of commit 098b67e:

NEW FAILURES - The following jobs have failed:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48169806

@guangy10 guangy10 added keep-going Don't stop on first failure, keep running tests until the end module: export release notes: export labels Aug 8, 2023
@guangy10 guangy10 changed the title [pytorch] Disable fast path for export [pytorch] Disable fast path in MultiheadAttention in Export Aug 8, 2023
@guangy10 guangy10 requested review from bdhirsh and kimishpatel August 8, 2023 23:10
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48169806

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

Fixed failures in test_transformers and test_jit_legacy

…106824)

Summary:
Pull Request resolved: pytorch#106824

We are seeing `aten._native_multi_head_attention` op (not in core Aten op set) is left in the exported graph and causes problems in the downstream at runtime.

Two proposed solutions:
 1. Disable fast path while tracing to leverage the non-optimized path to get decomp, that way, the blamed op won't show up in the exported graph
 2. Add a decomp rule for `aten._native_multi_head_attention`

After discussing with kimishpatel and bdhirsh, pytorch#1 is preferred and verified it could immediately unblock the critical model enablement work for PP.

Test Plan: CI

Reviewed By: kimishpatel

Differential Revision: D48169806

fbshipit-source-id: e82be1ab24659a976554775c7c362a00c827416e
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D48169806

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 9, 2023
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

Verified that these tests are broken in the base/trunk.

FAILED [0.0076s] test/test_autograd.py::TestAutogradFallback::test_base_does_not_require_grad_mode_nothing - RuntimeError: new_fn INTERNAL ASSERT FAILED at "/home/guangyang/pytorch/torch/csrc/autograd/variable.cpp":176, please report a bug to PyTorch.
FAILED [0.0010s] test/test_autograd.py::TestAutogradFallback::test_base_does_not_require_grad_mode_warn - RuntimeError: This is not allowed since there's already a kernel registered from python overriding foo's behavior for CPU dispatch key and _test_autograd_fallback namespace.
FAILED [0.0009s] test/test_autograd.py::TestAutogradFallback::test_composite_registered_to_cpu_mode_nothing - RuntimeError: This is not allowed since there's already a kernel registered from python overriding foo's behavior for CPU dispatch key and _test_autograd_fallback namespace.
FAILED [0.0030s] test/test_autograd.py::TestAutogradFallback::test_composite_registered_to_cpu_mode_warn - RuntimeError: Tried to register an operator (_test_autograd_fallback::foo(Tensor self) -> Tensor) with the same name and overload name multiple times. Each overload's schema should on...

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

Verified those tests are broken on base/trunk.

FAILED [0.0586s] test/dynamo/test_dynamic_shapes.py::DynamicShapesReproTests::test_dynamic_shapes_float_guard_dynamic_shapes - Failed: Unexpected success
FAILED [0.6717s] test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_sequence_nr - AssertionError: 'SeqN[542 chars]aten.ones_like.default|\n12|aten.expand.defaul[368 chars]t|\n' != 'SeqN[542 chars]aten.expand.default|\n12|aten.div.Scalar|\n11|[340 chars]t|\n'
FAILED [0.4966s] test/dynamo/test_dynamic_shapes.py::DynamicShapesAotAutogradFallbackTests::test_aot_sequence_nr_dynamic_shapes - AssertionError: 'SeqN[542 chars]aten.ones_like.default|\n12|aten.expand.defaul[368 chars]t|\n' != 'SeqN[542 chars]aten.expand.default|\n12|aten.div.Scalar|\n11|[340 chars]t|\n'

Those tests are actually flaky tests. If run it one-by-one, you may see it pass sometimes. Anyway, failures are not relevant to this PR

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

@pytorchbot merge -ic

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 9, 2023

-ic flag is deprecated, please use -i instead for the same effect.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 19 checks: pull / linux-jammy-py3.9-clang12-asan / test (default, 2, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 3, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 5, 6, linux.4xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 3, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 1, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu), trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-12)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

The distributed/test_distributed_spawn.py failures doesn't seem to be relevant.

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 22 checks: pull / linux-jammy-py3.9-clang12-asan / test (default, 2, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 3, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 5, 6, linux.4xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 3, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 1, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu), pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu), trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-12), trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu, unstable), trunk / linux-focal-rocm5.6-py3.8 / test (default, 3, 3, linux.rocm.gpu, unstable)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral), trunk / linux-bionic-cuda12.1-py3.10-gcc9 / test (nogpu_AVX512, 1, 1, linux.2xlarge)

Details for Dev Infra team Raised by workflow job

@guangy10
Copy link
Contributor Author

guangy10 commented Aug 9, 2023

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 25 checks: pull / linux-jammy-py3.9-clang12-asan / test (default, 2, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 3, 6, linux.4xlarge), pull / linux-jammy-py3.9-clang12-asan / test (default, 5, 6, linux.4xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (default, 3, 3, linux.2xlarge), pull / linux-bionic-py3.11-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (default, 2, 3, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 1, 2, linux.2xlarge), pull / linux-bionic-py3.8-clang9 / test (crossref, 2, 2, linux.2xlarge), pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge), pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu), pull / linux-bionic-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu), pull / linux-bionic-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu), trunk / macos-12-py3-arm64 / test (default, 1, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12), trunk / macos-12-py3-arm64 / test (default, 3, 3, macos-m1-12), trunk / win-vs2019-cpu-py3 / test (default, 3, 3, windows.4xlarge.nonephemeral), trunk / linux-focal-rocm5.6-py3.8 / test (default, 2, 3, linux.rocm.gpu, unstable), trunk / linux-focal-rocm5.6-py3.8 / test (default, 3, 3, linux.rocm.gpu, unstable), trunk / linux-bionic-cuda12.1-py3.10-gcc9 / test (nogpu_AVX512, 1, 1, linux.2xlarge), trunk / linux-bionic-cuda12.1-py3.10-gcc9 / test (nogpu_NO_AVX2, 1, 1, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@guangy10 guangy10 deleted the export-D48169806 branch August 10, 2023 16:34
@mikekgfb
Copy link
Contributor

the Inference Fastpath context manager in #107014 might offer a more general solution to controlling what kernels are being used (similar to the sdp_kernel context manager)

return False


def _is_make_fx_tracing():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this code in torch.nn ?
That doesn't look like the right place to put it and there are 100% chance this is going to get broken if proxy mode is modified as no-one will expect to update this file.
Could you please move this utility within fx?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @bdhirsh

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is a hack suggested by @bdhirsh to temporarily unblock Executorch during model tracing. #107014 seems to plan to handle this in a nicer way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That other PR is not removing this utility though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @albanD we should mvoe this within fx. @albanD any suggestions on where it can go?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to @bdhirsh for the final answer but next to the make_fx API would be my first suggestion

@mikekgfb
Copy link
Contributor

mikekgfb commented Aug 16, 2023 via email

@kimishpatel
Copy link
Contributor

@mikekgfb the PR you pointed to requires context manager to trigger certain path. In that it maybe a user controlled context manager. Case here is different in that, user, when exporting the model, is not making the choice. In fact it is not users decision at all. Which path, fast or slow decomposed, to take is based on tracing mode.

Cyril-Anto pushed a commit to Cyril-Anto/pytorch that referenced this pull request Aug 17, 2023
…106824)

Summary:
We are seeing `aten._native_multi_head_attention` op (not in core Aten op set) is left in the exported graph and causes problems in the downstream at runtime.

Two proposed solutions:
 1. Disable fast path while tracing to leverage the non-optimized path to get decomp, that way, the blamed op won't show up in the exported graph
 2. Add a decomp rule for `aten._native_multi_head_attention`

After discussing with kimishpatel and bdhirsh, pytorch#1 is preferred and verified it could immediately unblock the critical model enablement work for PP.

Test Plan: CI

Differential Revision: D48169806

Pull Request resolved: pytorch#106824
Approved by: https://github.com/kimishpatel
anijain2305 added a commit that referenced this pull request Oct 9, 2025
…MultiHeadAttention for strict export"


In #106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export. 

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.


This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the #106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  #164062


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela

[ghstack-poisoned]
anijain2305 added a commit that referenced this pull request Oct 9, 2025
…on for strict export"


In #106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export. 

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.


This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the #106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  #164062


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 9, 2025
…ct export (#164721)

In #106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export.

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.

This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the #106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  #164062

Pull Request resolved: #164721
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…ct export (pytorch#164721)

In pytorch#106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export.

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.

This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the pytorch#106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  pytorch#164062

Pull Request resolved: pytorch#164721
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request fb-exported keep-going Don't stop on first failure, keep running tests until the end Merged release notes: export

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants