Skip to content

[dynamo] Fix torch.compile crash with TorchFunctionMode that has mutable state#177095

Closed
mlazos wants to merge 6 commits intogh/mlazos/144/basefrom
gh/mlazos/144/head
Closed

[dynamo] Fix torch.compile crash with TorchFunctionMode that has mutable state#177095
mlazos wants to merge 6 commits intogh/mlazos/144/basefrom
gh/mlazos/144/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Mar 10, 2026

Fixes #172088

Stack from ghstack (oldest at bottom):

Previously, torch_function_mode_stack_state_mgr only cleared the C-level
mode stack during trace_frame (via preserve_global_state). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real __torch_function__ dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to compile_inner so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but DisableTorchFunction prevents dispatch during construction.

Co-authored-by: Claude noreply@anthropic.com

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177095

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 8 Unrelated Failures

As of commit 281adc2 with merge base 1fd1814 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: dynamo labels Mar 10, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mlazos added a commit that referenced this pull request Mar 10, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 1fbed6e
Pull-Request: #177095
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mlazos mlazos added release notes: dynamo ciflow/trunk Trigger trunk jobs on your pull request labels Mar 10, 2026
mlazos added a commit that referenced this pull request Mar 10, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 1fbed6e
Pull-Request: #177095
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Mar 10, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: a1db80c
Pull-Request: #177095
@mlazos mlazos requested review from Lucaskabela and williamwen42 and removed request for Lucaskabela March 11, 2026 00:26
Copy link
Copy Markdown
Contributor

@Lucaskabela Lucaskabela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude summarize what gaps might exist with this implementation?

@Lucaskabela
Copy link
Copy Markdown
Contributor

@claude are the test failures related to this PR?

@claude
Copy link
Copy Markdown

claude bot commented Mar 11, 2026

Claude finished @Lucaskabela's task in 4m 48s —— View job


Test Failure Analysis

  • Read the PR diff to understand the changes
  • Analyze failing test 1: test_unbacked_dtensor_op_db_kthvalue_cpu_float32
  • Analyze failing test 2: test_shape_id_runtime_assertion_on_mismatch
  • Determine if failures are related to this PR

Neither test failure is related to this PR. Here's the evidence:

1. test_unbacked_dtensor_op_db_kthvalue_cpu_float32 (distributed, linux-jammy-py3.10-gcc11)

  • Error: RuntimeError when making fake tensor call — fails inside get_fake_value when trying to trace torch.kthvalue on a Shard-placed DTensor with unbacked dims.
  • This is a fake tensor / unbacked symint issue in the DTensor op propagation path (_propagate_op_sharding_dispatch_slow_path), completely unrelated to torch function mode stack management.
  • The PR does not modify test/distributed/tensor/test_dtensor_ops.py or any DTensor/fake tensor code.

2. test_shape_id_runtime_assertion_on_mismatch (crossref, linux-jammy-py3.14-clang15)

  • Error: The test expects AssertionError but gets RuntimeError: Runtime assertion failed for expression Eq(u1, u0) — the assertion fires at runtime (via aten._assert_scalar) rather than at compile time as the test expects.
  • The stack trace shows the error routes through common_utils.py:1672 __torch_function__ — this is the crossref test mode (PYTORCH_TEST_WITH_CROSSREF=1), which installs a TorchFunctionMode from the test infrastructure itself.
  • This only fails in the crossref shard specifically (the non-crossref default shards pass), indicating it's a crossref-specific interaction, not a regression from mode stack changes.
  • The PR does not modify test/test_dynamic_shapes.py.

Both failures are pre-existing issues on the base branch (no changes to either test file in this PR: git diff ef1529f..55a1d29 shows no modifications to these files).


Also responding to the inline review comments separately.

@Lucaskabela
Copy link
Copy Markdown
Contributor

@claude why is lowering the context call here safe? Don't we risk missing some compilation with context we need?

@claude
Copy link
Copy Markdown

claude bot commented Mar 11, 2026

Claude finished @Lucaskabela's task in 2m 0s —— View job


  • Read the PR diff and understand the changes
  • Analyze the call chain: _compilecompile_inner_compile_innercompile_frametransformtrace_frame
  • Understand what torch_function_mode_stack_state_mgr does (save/clear/restore)
  • Respond to the question

Why lowering the context manager to compile_inner is safe

The key thing torch_function_mode_stack_state_mgr does is:

  1. On enter: snapshot the current C-level TorchFunctionMode stack into self.stack, then clear the C stack
  2. On exit: restore the C stack from the saved snapshot

The modes aren't lost — they're saved in self.stack for the duration of compilation. The question is whether anything in the compilation pipeline needs real __torch_function__ dispatch to be active.

It doesn't. The entire compilation pipeline (tracing, guard building, code generation) operates on representations of mode state, not on live dispatch:

  1. Tracing (trace_frame): The transform function at convert_frame.py:1368-1369 reads the saved mode list from torch_function_mode_stack_state_mgr.stack (previously it called torch.overrides._get_current_function_mode_stack() which did the same thing, just from the C stack before it was cleared by preserve_global_state). This Python list of mode objects is passed into InstructionTranslator where Dynamo wraps them as TorchFunctionModeVariables. Tracing works on these variables symbolically — it doesn't need the modes to be on the C dispatch stack.

  2. Guard building (convert_frame.py:1664-1671): Guard expressions do need to reference mode objects (e.g., checking type(mode_stack[0]) or reading mode attributes). The PR handles this via temp_restore_stack() + DisableTorchFunction() — modes are briefly put back on the C stack so guard expressions can evaluate against them, but DisableTorchFunction prevents any actual __torch_function__ dispatch from firing during construction.

  3. Everything else (restart handling, code validation, cleanup): These don't interact with torch function dispatch at all.

The previous placement in preserve_global_state (which decorates trace_frame) was actually too narrow — it only covered tracing, leaving guard building and post-trace work exposed to live dispatch. That's exactly the bug this PR fixes: modes with mutable state (like the CounterMode test) would get their __torch_function__ called during guard construction, mutating state and causing the "guard failed on the same frame it was created" error.

So moving the clear/restore scope to compile_inner is strictly more correct — it covers the full compilation pipeline while still providing the mode list to everything that needs it.


View job

Copy link
Copy Markdown
Contributor

@Lucaskabela Lucaskabela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - claude comments make sense, code is tested, and focused

@Lucaskabela
Copy link
Copy Markdown
Contributor

@pytorchmergebot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

[ghstack-poisoned]
mlazos added a commit that referenced this pull request Mar 11, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: c04d8df
Pull-Request: #177095
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Mar 11, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

[ghstack-poisoned]
mlazos added a commit that referenced this pull request Mar 11, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 3aa4f10
Pull-Request: #177095
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Mar 11, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 2, 5, linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

mlazos added a commit that referenced this pull request Mar 12, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 3aa4f10
Pull-Request: #177095
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Mar 12, 2026
…ble state

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: b14b28d
Pull-Request: #177095
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Mar 12, 2026

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ble state (pytorch#177095)

Fixes pytorch#172088

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>

Pull Request resolved: pytorch#177095
Approved by: https://github.com/Lucaskabela
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…has mutable state (pytorch#177095)"

This reverts commit a65094f.

Reverted pytorch#177095 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#177095 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ble state (pytorch#177095)

Fixes pytorch#172088

Previously, `torch_function_mode_stack_state_mgr` only cleared the C-level
mode stack during `trace_frame` (via `preserve_global_state`). This meant
compilation infrastructure running outside tracing — guard building, global
state cleanup — would trigger real `__torch_function__` dispatch, mutating
mode state (e.g. incrementing a counter) and causing the compile-time guard
verification to fail with "Guard failed on the same frame it was created".

This change moves the mode stack save/clear/restore up to `compile_inner` so
modes are off the C stack for the entire compilation pipeline. For guard
building, modes are temporarily restored so guard expressions can reference
them, but `DisableTorchFunction` prevents dispatch during construction.

Co-authored-by: Claude <noreply@anthropic.com>

Pull Request resolved: pytorch#177095
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: dynamo release notes: dynamo Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants