workarounds for all2all autograd issues that Ruisi ran into by bdhirsh · Pull Request #1604 · pytorch/torchtitan

bdhirsh · 2025-08-20T02:42:51Z

The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the _A2A.apply used in EP:

(1) @ruisizhang123 / @anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this PR, and this command: CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile

(2) old tlparse showed that there are no aten._grouped_mm calls in the backward graph, showing that we are not properly differentiating the MoE code (tlparse)

if you look at the dynamo graph in that tlparse, though, you'll see nodes for _c10d_functional.all_to_all_single and _c10d_functional.wait_tensor. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable.

(3) I didn't root cause why dynamo is "ignoring" the autograd.Function. I have some suspicion about the fact that the autograd.Function in question accepts a ProcessGroup.

Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it.

As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this comment

Here's a good tlparse with my changes, where you can see proper aten._grouped_mm nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Stack from ghstack (oldest at bottom):

-> workarounds for all2all autograd issues that Ruisi ran into #1604

[ghstack-poisoned]

ghstack-source-id: 800eed7 Pull Request resolved: #1604

The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the `_A2A.apply` used in EP: (1) ruisizhang123 / anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this [PR](https://github.com/pytorch/torchtitan/pull/1529/files), and this command: `CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile` (2) old tlparse showed that there are no `aten._grouped_mm` calls in the backward graph, showing that we are not properly differentiating the MoE code ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpxMfgwf/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)) if you look at the dynamo graph in that tlparse, though, you'll see nodes for `_c10d_functional.all_to_all_single` and `_c10d_functional.wait_tensor`. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable. (3) I didn't root cause why dynamo is "ignoring" the `autograd.Function`. I have some suspicion about the fact that the autograd.Function in question accepts a `ProcessGroup`. Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it. As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this [comment](#1467 (comment)) Here's a good tlparse with my changes, where you can see proper `aten._grouped_mm` nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 [ghstack-poisoned]

ghstack-source-id: 035c32c Pull Request resolved: #1604

xmfan · 2025-08-20T14:25:50Z

@bdhirsh I have a fix for the root cause here: pytorch/pytorch#161036

@anijain2305

…1036) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671 Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES #160901 It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh @ruisizhang123 Pull Request resolved: #161036 Approved by: https://github.com/anijain2305 ghstack dependencies: #160805

@anijain2305

…orch#161036) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671 Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES pytorch#160901 It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh @ruisizhang123 Pull Request resolved: pytorch#161036 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#160805

meta-cla · 2026-02-18T00:28:08Z

Hi @bdhirsh!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

wconstab · 2026-02-21T00:45:54Z

@bdhirsh @xmfan is this PR closeable? looks like @xmfan landed pytorch/pytorch#161036 already

workarounds for all2all autograd issues that Ruisi ran into

221bdaf

[ghstack-poisoned]

bdhirsh requested review from fegin, tianyu-l, wconstab and wwwjn as code owners August 20, 2025 02:42

bdhirsh added a commit that referenced this pull request Aug 20, 2025

workarounds for all2all autograd issues that Ruisi ran into

7854389

ghstack-source-id: 800eed7 Pull Request resolved: #1604

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2025

bdhirsh added a commit that referenced this pull request Aug 20, 2025

workarounds for all2all autograd issues that Ruisi ran into

8bb6371

ghstack-source-id: 035c32c Pull Request resolved: #1604

tianyu-l mentioned this pull request Aug 20, 2025

Wrap sync + a2a in a custom op #1597

Closed

xmfan mentioned this pull request Aug 20, 2025

[dynamo] propagate tensor metadata on Tensor.__setitem__(tensor) pytorch/pytorch#161036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workarounds for all2all autograd issues that Ruisi ran into#1604

workarounds for all2all autograd issues that Ruisi ran into#1604
bdhirsh wants to merge 2 commits intogh/bdhirsh/4/basefrom
gh/bdhirsh/4/head

bdhirsh commented Aug 20, 2025 •

edited

Loading

Uh oh!

xmfan commented Aug 20, 2025 •

edited

Loading

Uh oh!

meta-cla bot commented Feb 18, 2026

Uh oh!

wconstab commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bdhirsh commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xmfan commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-cla bot commented Feb 18, 2026

Process

Uh oh!

wconstab commented Feb 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdhirsh commented Aug 20, 2025 •

edited

Loading

xmfan commented Aug 20, 2025 •

edited

Loading