workarounds for all2all autograd issues that Ruisi ran into#1604
workarounds for all2all autograd issues that Ruisi ran into#1604bdhirsh wants to merge 2 commits intogh/bdhirsh/4/basefrom
Conversation
[ghstack-poisoned]
The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the `_A2A.apply` used in EP: (1) ruisizhang123 / anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this [PR](https://github.com/pytorch/torchtitan/pull/1529/files), and this command: `CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile` (2) old tlparse showed that there are no `aten._grouped_mm` calls in the backward graph, showing that we are not properly differentiating the MoE code ([tlparse](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpxMfgwf/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)) if you look at the dynamo graph in that tlparse, though, you'll see nodes for `_c10d_functional.all_to_all_single` and `_c10d_functional.wait_tensor`. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable. (3) I didn't root cause why dynamo is "ignoring" the `autograd.Function`. I have some suspicion about the fact that the autograd.Function in question accepts a `ProcessGroup`. Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it. As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this [comment](#1467 (comment)) Here's a good tlparse with my changes, where you can see proper `aten._grouped_mm` nodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000 [ghstack-poisoned]
|
@bdhirsh I have a fix for the root cause here: pytorch/pytorch#161036 |
…1036) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671 Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES #160901 It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh @ruisizhang123 Pull Request resolved: #161036 Approved by: https://github.com/anijain2305 ghstack dependencies: #160805
…orch#161036) Fixes silent incorrectness for autograd function tracing, where we rely on FakeTensor metadata (requires_grad) to determine whether to HOP or not: https://github.com/pytorch/pytorch/blob/5ee464db5c4293ac09521f9069fa7d2106680a7f/torch/_dynamo/variables/misc.py#L671 Stared at this with @anijain2305 yesterday, `Tensor.__setitem__` can update tensor metadata, and we can just run the fake prop and extract the output metadata from the updated FakeTensor. FIXES pytorch#160901 It should also be the root cause behind the issue in pytorch/torchtitan#1604 @bdhirsh @ruisizhang123 Pull Request resolved: pytorch#161036 Approved by: https://github.com/anijain2305 ghstack dependencies: pytorch#160805
|
Hi @bdhirsh! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@bdhirsh @xmfan is this PR closeable? looks like @xmfan landed pytorch/pytorch#161036 already |
The idea behind this PR is that there appears to be an issue in the way that dynamo is inlining the
_A2A.applyused in EP:(1) @ruisizhang123 / @anijain2305 noticed that the MoE code wasn't getting a proper backward graph generated for it when being run with compile (notes here: https://docs.google.com/document/d/1l1PfSrhJRXRLa1usBsEh4YSidgJiDU_eVNJ01p7PetM/edit?tab=t.0), running on tip-of-main with this PR, and this command:
CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseekv3_simple_fsdp --profiling.enable_profiling --parallelism.expert_parallel_degree 2 --training.compile(2) old tlparse showed that there are no
aten._grouped_mmcalls in the backward graph, showing that we are not properly differentiating the MoE code (tlparse)if you look at the dynamo graph in that tlparse, though, you'll see nodes for
_c10d_functional.all_to_all_singleand_c10d_functional.wait_tensor. This is a problem - the dynamo graph should be capturing the entire autograd.Function as a HOP, since those two ops are not differentiable.(3) I didn't root cause why dynamo is "ignoring" the
autograd.Function. I have some suspicion about the fact that the autograd.Function in question accepts aProcessGroup.Instead, I just (a) wrote a slightly different version of the autograd.Function that doesn't accept ProcessGroup, and (b) allow-in-graph'd it for good measure, to guarantee that dynamo won't incorrectly inline it.
As a followup, we should either debug why dynamo is not constructing a HOP for the autograd.Function, or kill the _A2A.apply code entirely as per this comment
Here's a good tlparse with my changes, where you can see proper
aten._grouped_mmnodes in the backward graph: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/9b03e768-73a1-4154-b7af-dcd79cfc754d/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000Stack from ghstack (oldest at bottom):