[Bugfix] Fix by copying sym shapes as needed#178210
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/178210
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 4 New Failures, 3 Unrelated FailuresAs of commit f79f967 with merge base be2c9ee ( NEW FAILURES - The following jobs have failed:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
2 similar comments
This PR needs a
|
This PR needs a
|
|
NOTE: this should fix the issue mentioned in pytorch/torchtitan#2638 and allow us to reenable compile in torch rl |
a59f785 to
fd4c7ec
Compare
|
There was a question from @pianpwk on the issue/ a previous iteration of this PR like "Why does this cause an error?" The answer is concretely we generate an incorrect graph because we have two symbols representing sequence length (one at a global level, one at a local level), and we cannot assert that We validate that in the second test case added narrow does not affect the correctness of this, and is a noop when the sizes are already the same - it is mostly for symbolic propogation |
| if local_tensor.size(self.dim) is not logical_dim_size: | ||
| # torch._check ensures correctness | ||
| torch._check(local_tensor.size(self.dim) >= logical_dim_size) | ||
| local_tensor = local_tensor.narrow(self.dim, 0, logical_dim_size) |
There was a problem hiding this comment.
just confirming, this operation should never actually change the real shape of local_tensor right? it is strictly here to add information to the symbolic shapes?
There was a problem hiding this comment.
That is correct - I added an explicit assert saying this
## Summary We turned the compile for generator off in #2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts #2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
## Summary We turned the compile for generator off in #2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts pytorch#2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
Fixes #175690 ## Summary We discovered an error in torchtitan related to symbolic sequence length - namely, when doing piecewise cudagraphs with batch size 1 and seqlen 1, and sharding 2, we found in `to_local()` we incorrectly created an extra symbol that we could not simplify (with size like `2 * (s27 // 2)`. To fix this error, we could narrow in user code, however the more proper fix is to do this here in`to_local()` itself - here we can correctly fix the symbolic shape ## Test plan Added 2 unit tests Authored with claude Pull Request resolved: #178210 Approved by: https://github.com/wconstab Co-authored-by: Xia-Weiwen <12522207+Xia-Weiwen@users.noreply.github.com>
…nfigurations in example_dcp When reconstructing sharding placements from the serialized map for a different mesh, we were only updating .mesh on each DTensorSpec but leaving tensor_meta unchanged. Since the batch size scales with the mesh (bs = 8 * mesh.shape[0]), the tensor shapes differ between the two phases (256 vs 16 on dim 0), so the specs carried stale shapes from the original configuration. This was previously harmless — PyTorch's _maybe_unpad_tensor only used logical_dim_size (derived from tensor_meta.shape) to compute the unpad amount, and with even sharding that was a no-op. PyTorch PR pytorch/pytorch#178210 added a torch._check(orig_size >= logical_dim_size) assertion that now catches the mismatch. Fix: pull correct tensor_meta from the new AutoParallel instance's computed strategies, which reflect the actual shapes for the current mesh and batch size. Authored with Claude.
* Work around upstream PyTorch view op assertion in propagation rules PyTorch PR pytorch/pytorch#177973 fixed a latent bug where a comparison shard.dim == in_dim was always False (comparing int to InputDim), making the double-shard submesh_size validation in propagate_shape_and_sharding dead code. The fix activated that validation, which now raises an AssertionError when a sharded dimension is not divisible by the mesh size (e.g. unflatten nheads=48 on mesh dim size 32). This is overly strict for our use case — we call propagate_shape_and_sharding with strict_view=False to enumerate all possible sharding strategies, and incompatible ones should be silently skipped rather than raising. We now catch the AssertionError and skip the iteration, since the replicated variant is already covered by another strategy. Authored with Claude. * Fix compatibility with latest PyTorch nightly by replacing removed pointwise_strategy PyTorch removed pointwise_strategy from torch.distributed.tensor._ops._pointwise_ops in pytorch/pytorch#177208 as part of a dead code cleanup. Our custom native_layer_norm and native_layer_norm_backward rules imported this function, causing an ImportError on PyTorch nightlies after 2025-03-23. This replaces the import with a local _pointwise_strategy that recomposes the same behavior from PyTorch's new single-dim strategy primitives (_common_pointwise_single_dim_strategy, _fill_single_dim_strategy_placeholders, expand_to_full_mesh_op_strategy). Authored with Claude. * Bugfix The issue was that _get_unique_placements from PyTorch's single-dim strategy path asserts each OpStrategy has exactly 1 strategy — but in autoparallel, input OpStrategy objects have multiple strategies (all possible placements). Replaced it with an inline loop that collects unique placements from all strategies across all inputs. * One more fix The fix: _common_pointwise_single_dim_strategy produces num_outputs output entries per strategy row (e.g., 3 for native_layer_norm's out/mean/rstd). Since all pointwise outputs share the same placement, we collapse to a single output entry before passing to expand_to_full_mesh_op_strategy with input_index=1. This way strategy.output_specs is a single DTensorSpec, matching the old pointwise_strategy behavior — the callers then construct the per-output specs (out, mean, rstd) themselves. * Fix stale tensor_meta when reusing sharding placements across mesh configurations in example_dcp When reconstructing sharding placements from the serialized map for a different mesh, we were only updating .mesh on each DTensorSpec but leaving tensor_meta unchanged. Since the batch size scales with the mesh (bs = 8 * mesh.shape[0]), the tensor shapes differ between the two phases (256 vs 16 on dim 0), so the specs carried stale shapes from the original configuration. This was previously harmless — PyTorch's _maybe_unpad_tensor only used logical_dim_size (derived from tensor_meta.shape) to compute the unpad amount, and with even sharding that was a no-op. PyTorch PR pytorch/pytorch#178210 added a torch._check(orig_size >= logical_dim_size) assertion that now catches the mismatch. Fix: pull correct tensor_meta from the new AutoParallel instance's computed strategies, which reflect the actual shapes for the current mesh and batch size. Authored with Claude.
## Summary This PR 1) Reapplies #2710 ## Test plan PREREQ: ensure pytorch/pytorch#178210 is in your torch version ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ``` With both compiles on, we expect a 4x speedup over eager (timed from ~400s e2e to ~100s for 10 steps) ```bash torchrun --nproc_per_node=2 \ torchtitan/experiments/rl/tests/test_bitwise_identity.py ``` For numerics results in: ``` Trainer computed 30 token log-probs vLLM log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] Trainer log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : True Tokens checked : 30 Tokens different : 0 Max delta : 0.000000e+00 Avg delta : 0.000000e+00 Diff mean : 0.000000e+00 Diff max : 0.000000e+00 ============================================================ PASS: vLLM and trainer log-probs are bitwise identical. /home/lucaskabela/.conda/envs/pytorch_build/lib/python3.10 ```
## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts pytorch#2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
## Summary This PR 1) Reapplies pytorch#2710 ## Test plan PREREQ: ensure pytorch/pytorch#178210 is in your torch version ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ``` With both compiles on, we expect a 4x speedup over eager (timed from ~400s e2e to ~100s for 10 steps) ```bash torchrun --nproc_per_node=2 \ torchtitan/experiments/rl/tests/test_bitwise_identity.py ``` For numerics results in: ``` Trainer computed 30 token log-probs vLLM log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] Trainer log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : True Tokens checked : 30 Tokens different : 0 Max delta : 0.000000e+00 Avg delta : 0.000000e+00 Diff mean : 0.000000e+00 Diff max : 0.000000e+00 ============================================================ PASS: vLLM and trainer log-probs are bitwise identical. /home/lucaskabela/.conda/envs/pytorch_build/lib/python3.10 ```
Fixes pytorch#175690 ## Summary We discovered an error in torchtitan related to symbolic sequence length - namely, when doing piecewise cudagraphs with batch size 1 and seqlen 1, and sharding 2, we found in `to_local()` we incorrectly created an extra symbol that we could not simplify (with size like `2 * (s27 // 2)`. To fix this error, we could narrow in user code, however the more proper fix is to do this here in`to_local()` itself - here we can correctly fix the symbolic shape ## Test plan Added 2 unit tests Authored with claude Pull Request resolved: pytorch#178210 Approved by: https://github.com/wconstab
## Summary We turned the compile for generator off in #2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts #2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
## Summary This PR 1) Reapplies #2710 ## Test plan PREREQ: ensure pytorch/pytorch#178210 is in your torch version ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ``` With both compiles on, we expect a 4x speedup over eager (timed from ~400s e2e to ~100s for 10 steps) ```bash torchrun --nproc_per_node=2 \ torchtitan/experiments/rl/tests/test_bitwise_identity.py ``` For numerics results in: ``` Trainer computed 30 token log-probs vLLM log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] Trainer log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : True Tokens checked : 30 Tokens different : 0 Max delta : 0.000000e+00 Avg delta : 0.000000e+00 Diff mean : 0.000000e+00 Diff max : 0.000000e+00 ============================================================ PASS: vLLM and trainer log-probs are bitwise identical. /home/lucaskabela/.conda/envs/pytorch_build/lib/python3.10 ```
Fixes #175690 ## Summary We discovered an error in torchtitan related to symbolic sequence length - namely, when doing piecewise cudagraphs with batch size 1 and seqlen 1, and sharding 2, we found in `to_local()` we incorrectly created an extra symbol that we could not simplify (with size like `2 * (s27 // 2)`. To fix this error, we could narrow in user code, however the more proper fix is to do this here in`to_local()` itself - here we can correctly fix the symbolic shape ## Test plan Added 2 unit tests Authored with claude Pull Request resolved: #178210 Approved by: https://github.com/wconstab
Fixes pytorch#175690 ## Summary We discovered an error in torchtitan related to symbolic sequence length - namely, when doing piecewise cudagraphs with batch size 1 and seqlen 1, and sharding 2, we found in `to_local()` we incorrectly created an extra symbol that we could not simplify (with size like `2 * (s27 // 2)`. To fix this error, we could narrow in user code, however the more proper fix is to do this here in`to_local()` itself - here we can correctly fix the symbolic shape ## Test plan Added 2 unit tests Authored with claude Pull Request resolved: pytorch#178210 Approved by: https://github.com/wconstab
## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts pytorch#2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
## Summary This PR 1) Reapplies pytorch#2710 ## Test plan PREREQ: ensure pytorch/pytorch#178210 is in your torch version ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ``` With both compiles on, we expect a 4x speedup over eager (timed from ~400s e2e to ~100s for 10 steps) ```bash torchrun --nproc_per_node=2 \ torchtitan/experiments/rl/tests/test_bitwise_identity.py ``` For numerics results in: ``` Trainer computed 30 token log-probs vLLM log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] Trainer log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : True Tokens checked : 30 Tokens different : 0 Max delta : 0.000000e+00 Avg delta : 0.000000e+00 Diff mean : 0.000000e+00 Diff max : 0.000000e+00 ============================================================ PASS: vLLM and trainer log-probs are bitwise identical. /home/lucaskabela/.conda/envs/pytorch_build/lib/python3.10 ```
## Summary We turned the compile for generator off in pytorch#2638 due to conflict with DTensor and symbolic propogation We fix this in pytorch/pytorch#178210 so reenable this config (once landed in nightly) ## Test ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ```
Reverts pytorch#2710 Need to wait another day for nightly to pickup pytorch/pytorch#178210 Co-authored-by: Jiani Wang <jianiwangw@gmail.com>
## Summary This PR 1) Reapplies pytorch#2710 ## Test plan PREREQ: ensure pytorch/pytorch#178210 is in your torch version ```bash python torchtitan/experiments/rl/simple_grpo_sum_digits.py --module rl --config rl_grpo_qwen3_0_6b --hf_assets_path=torchtitan/experiments/rl/example_checkpoint/Qwen3-0.6B ``` With both compiles on, we expect a 4x speedup over eager (timed from ~400s e2e to ~100s for 10 steps) ```bash torchrun --nproc_per_node=2 \ torchtitan/experiments/rl/tests/test_bitwise_identity.py ``` For numerics results in: ``` Trainer computed 30 token log-probs vLLM log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] Trainer log-probs[:5]: [-4.129570484161377, -1.795021891593933, -0.71578049659729, -0.2110116183757782, -0.9374725222587585] ============================================================ LOGPROB COMPARISON RESULTS ============================================================ Bitwise identical : True Tokens checked : 30 Tokens different : 0 Max delta : 0.000000e+00 Avg delta : 0.000000e+00 Diff mean : 0.000000e+00 Diff max : 0.000000e+00 ============================================================ PASS: vLLM and trainer log-probs are bitwise identical. /home/lucaskabela/.conda/envs/pytorch_build/lib/python3.10 ```
Fixes #175690
Summary
We discovered an error in torchtitan related to symbolic sequence length - namely, when doing piecewise cudagraphs with batch size 1 and seqlen 1, and sharding 2, we found in
to_local()we incorrectly created an extra symbol that we could not simplify (with size like2 * (s27 // 2).To fix this error, we could narrow in user code, however the more proper fix is to do this here in
to_local()itself - here we can correctly fix the symbolic shapeTest plan
Added 2 unit tests
Authored with claude
cc @ezyang @penguinwu @bobrenjc93 @aditvenk @laithsakka