Fix: Ensure internal ApplyTemplate uses modern autograd API for torch.func.grad + compile by dumko2001 · Pull Request #169786 · pytorch/pytorch

dumko2001 · 2025-12-07T10:05:20Z

Addresses #169783.

This fixes a RuntimeError when using torch.compile(torch.func.grad(...)) on a function with nested calls involving a custom torch.autograd.Function (e.g., f(f(x))).

Root Cause:
The crash was caused by an internal helper class, ApplyTemplate, defined dynamically in torch/_functorch/autograd_function.py. This class was using the legacy autograd.Function API (def forward(ctx, *args):) and lacked the required setup_context method for AOTAutograd/functorch compatibility.

The Fix:
Refactored ApplyTemplate to adhere to the modern autograd.Function API:

Changed forward signature to def forward(*args):.
Introduced def setup_context(ctx, inputs, output): and moved context-setting logic (like ctx.mark_non_differentiable) into it.

Testing:
A new regression test, test_compile_grad_nested_autograd_function, has been added to test/functorch/test_aotdispatch.py to ensure the issue does not regress and the compiled result is correct against the eager output.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @kadeng @chauhang @amjames @Lucaskabela @jataylo @mlazos

…orch#169783) Refactors ApplyTemplate in autograd_function.py to implement setup_context, required for functorch/compile compatibility. Adds regression test TestCompileNestedAutograd.

pytorch-bot · 2025-12-07T10:05:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169786

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 2d217c7 with merge base 143c71a ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc11-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST
pull / cuda13.0-py3.10-gcc11-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

dumko2001 · 2025-12-07T10:07:34Z

@pytorchbot label "bug" "release notes: dynamo"

pytorch-bot · 2025-12-07T10:07:42Z

Didn't find following labels among repository labels: bug

dumko2001 · 2025-12-07T10:08:22Z

@pytorchbot label "topic: not user facing"

soulitzer · 2025-12-12T22:05:56Z

        self.skipTest("Skipping because it fails in strict cache mode")


+class TestCompileNestedAutograd(TestCase):


Maybe put this in TestCompileTransforms in test/functorch/test_eager_transforms.py` instead?

Maybe put this into https://github.com/pytorch/pytorch/blob/main/test/dynamo/test_autograd_function.py please and use backend=aot_eager

@zou3519 Done! I've moved the test to test/dynamo/test_autograd_function.py and updated the torch.compile call to use backend="aot_eager" as requested.

soulitzer

Thanks, had a small comment on test location
Any idea why this only happens when the function is recursively called?

linux-foundation-easycla · 2025-12-18T16:17:33Z

The committers listed above are authorized under a signed CLA.

✅ login: dumko2001 / name: dumko2001 (2d217c7)

dumko2001 · 2025-12-18T16:18:26Z

Thanks, had a small comment on test location Any idea why this only happens when the function is recursively called?

@soulitzer Regarding the recursion: The recursion causes the internal ApplyTemplate (which wraps the user's autograd.Function) to be traced by func.grad in a nested context. The issue was that the internal implementation used the legacy forward(ctx, ...) signature. AOTAutograd enforces the modern autograd.Function API (static forward without ctx + setup_context) to correctly handle tensor saving and context management during these complex graph captures. The legacy signature prevented setup_context from being properly invoked, leading to the crash.

zou3519

lgtm if tests pass

zou3519 · 2025-12-20T01:47:03Z

still waiting for tests to pass

dumko2001 · 2025-12-21T09:27:08Z

@zou3519 @soulitzer

I have analyzed the pr_time_benchmarks failures. The logic tests (including the new regression test) are passing, but compile_time_instruction_count has regressed on basic_modules_ListOfLinears_inductor.

Root Cause Analysis
This regression is a deterministic side-effect of the fix, not an optimization failure.

Legacy Implementation: ApplyTemplate used the legacy forward(ctx, *args) signature. This resulted in a single Python function call during compilation.
New Implementation: To fix the RuntimeError and satisfy AOTAutograd/functorch requirements, I refactored ApplyTemplate to the modern torch.autograd.Function API.
- This splits execution into two distinct calls: forward(*args) and setup_context(ctx, inputs, output).
- The Autograd engine now incurs additional overhead to dispatch setup_context, unpack inputs, and manage the context object explicitly.

Conclusion
The instruction count increase corresponds to the mandatory overhead of the modern Autograd API (specifically the setup_context mechanism). This structural change is the only way to support nested torch.func.grad transformations correctly.

Since this overhead is intrinsic to the fix and the logic is verified, could you please update the benchmarks or merge with this known baseline shift?

github-actions · 2026-02-19T09:47:43Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Fix torch.compile failure with torch.func.grad nested call (Issue pyt…

7c6f2c6

…orch#169783) Refactors ApplyTemplate in autograd_function.py to implement setup_context, required for functorch/compile compatibility. Adds regression test TestCompileNestedAutograd.

dumko2001 requested review from Chillee and ezyang as code owners December 7, 2025 10:05

pytorch-bot Bot added the release notes: dynamo label Dec 7, 2025

pytorch-bot Bot added the topic: not user facing topic category label Dec 7, 2025

dumko2001 mentioned this pull request Dec 7, 2025

torch.compile fails for torch.func.grad with nested function:f(f(x)) #169783

Closed

pytorchbot added the open source label Dec 7, 2025

soulitzer reviewed Dec 12, 2025

View reviewed changes

soulitzer requested a review from zou3519 December 12, 2025 22:07

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 12, 2025

pytorch-bot Bot added the module: dynamo label Dec 18, 2025

Relocate regression test to test/dynamo/test_autograd_function.py

935a455

dumko2001 force-pushed the fix/issue-169783 branch from 493dfaa to 935a455 Compare December 18, 2025 16:19

dumko2001 requested a review from soulitzer December 18, 2025 16:20

zou3519 reviewed Dec 18, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/issue-169783

2d217c7

dumko2001 requested a review from zou3519 December 19, 2025 01:24

github-actions Bot added the Stale label Feb 19, 2026

github-actions Bot closed this Mar 21, 2026

		self.skipTest("Skipping because it fails in strict cache mode")


		class TestCompileNestedAutograd(TestCase):

Conversation

dumko2001 commented Dec 7, 2025 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/169786

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

dumko2001 commented Dec 7, 2025

Uh oh!

pytorch-bot Bot commented Dec 7, 2025

Uh oh!

dumko2001 commented Dec 7, 2025

Uh oh!

soulitzer Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

zou3519 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

dumko2001 Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

soulitzer left a comment

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla Bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumko2001 commented Dec 18, 2025

Uh oh!

zou3519 left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Dec 20, 2025

Uh oh!

dumko2001 commented Dec 21, 2025

Uh oh!

github-actions Bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dumko2001 commented Dec 7, 2025 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Dec 7, 2025 •

edited

Loading

linux-foundation-easycla Bot commented Dec 18, 2025 •

edited

Loading