[DTensor][Export] Supporting exporting a model with DTensor params/inputs by SherlockNoMad · Pull Request #163609 · pytorch/pytorch

SherlockNoMad · 2025-09-23T04:30:45Z

I experimented with 3 paths to get joint graph for DTensorized module and input

strict_export + aot_export_joint_with_descriptors
graph_capture + aot_export_joint_with_descriptors
aot_export_joint_with_descriptors alone

Added test to guard them.

1 doesn't work, as bw graph region is missing from the joint graph.
I am leaning towards making 2 the recommended path.
If 2 doesn't work going forward, we can fallback to 3.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @ezyang @msaroufim @dcci

pytorch-bot · 2025-09-23T04:30:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163609

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 949ea83 with merge base 8701f18 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/export/_trace.py

torch/fx/experimental/symbolic_shapes.py

torch/_subclasses/meta_utils.py

torch/_guards.py

ezyang · 2025-09-24T02:20:35Z

The interactions with subsystems I'm not familiar with are a ? to me, but the parts I do understand look fine

torch/distributed/tensor/_dtensor_spec.py

tugsbayasgalan · 2025-09-28T18:17:48Z

test/distributed/tensor/test_dtensor_export.py

+
+def strict_export_and_aot_export_joint_with_descriptors(model, inputs):
+    # install_free_tensors is required for dynamo to work
+    with torch._dynamo.config.patch(install_free_tensors=True):


I think you also need to turn on inline_inbuilt_nn_modules.

it's passing strict export without this patch

We want to turn this flag on very soon (#163921) and it could fix various silent bugs so i think we should always have it on if the downstream users end up using this API because it might take some reverts/time to properly land the mentioned PR.

test/distributed/tensor/test_dtensor_export.py

As title, one issue was that our fake mode detection didn't understand dtensor. RFC because: - I'm a dtensor noob so I don't know if this is the right way to use dtensor - I don't like making torch/_guards.py aware of DTensor, looking for suggestions on alternative ways to structure the code.

SherlockNoMad · 2025-09-29T23:03:38Z

test/distributed/tensor/experimental/test_tp_transform.py

+                pytree.register_constant(DTensorSpec)
+
+        # TODO: Having DTensorSpec in pytree causes issue with tensor_parallel_transformation
+        # Need to understand the interaction here


@tugsbayasgalan

Some thing weird is going on here.

Also, I don't need torch.utils._pytree.register_constant(DTensorSpec) to make _dynamo_graph_capture_for_export passing.

It's just needed for strict export.

alright, I reverted pytree.register_constant(DTensorSpec).

Ohh interesting ok

SherlockNoMad · 2025-09-30T03:33:59Z

@pytorchbot merge

pytorchmergebot · 2025-09-30T03:35:51Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>

…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>

register_constant(DTensorSpec) in the export test helper was permanently modifying global pytree state, causing subsequent compiled DTensor tests to fail. With DTensorSpec registered as a pytree constant, dynamo no longer decomposes glu into simpler ops that have sharding strategies, so aten.glu.default gets passed through to DTensor dispatch which can't handle it. Wrap in try/finally to deregister after use. Introduced in PR #163609 Authored with Claude. Pull Request resolved: #176128 Approved by: https://github.com/SherlockNoMad

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor release notes: export labels Sep 23, 2025

SherlockNoMad commented Sep 23, 2025

View reviewed changes

torch/export/_trace.py Outdated Show resolved Hide resolved

ezyang reviewed Sep 24, 2025

View reviewed changes

torch/fx/experimental/symbolic_shapes.py Outdated Show resolved Hide resolved

ezyang reviewed Sep 24, 2025

View reviewed changes

torch/_subclasses/meta_utils.py Outdated Show resolved Hide resolved

ezyang reviewed Sep 24, 2025

View reviewed changes

torch/_guards.py Outdated Show resolved Hide resolved

SherlockNoMad force-pushed the bahuang/export_dtensor branch from bf31cea to 71f25c4 Compare September 27, 2025 00:05

SherlockNoMad marked this pull request as ready for review September 27, 2025 00:06

SherlockNoMad added the topic: not user facing topic category label Sep 27, 2025

SherlockNoMad changed the title ~~[rfc] Supporting exporting a model with DTensor params/inputs~~ [DTensor][Export] Supporting exporting a model with DTensor params/inputs Sep 27, 2025

SherlockNoMad requested review from anijain2305, avikchaudhuri, suo, tianyu-l and tugsbayasgalan September 27, 2025 00:06

tugsbayasgalan reviewed Sep 28, 2025

View reviewed changes

suo and others added 11 commits September 29, 2025 15:19

hacks

dcd94f5

fixes

e301c55

pytree.register_constant for DTensorSpec

6d4ca00

cleanup test

fbcebfc

revert torch/_subclasses/meta_utils.py

920168c

revert torch/export/_trace.py

1039d1d

cleanup

5b547f8

revert torch/_guards.py

1ca1d26

fix test

4a89cde

clean up test

4dfcfa6

SherlockNoMad added 2 commits September 29, 2025 15:20

workaround tensor_parallel_transformation issues

22f413b

lint

61f71fb

SherlockNoMad force-pushed the bahuang/export_dtensor branch from 4ab5216 to 61f71fb Compare September 29, 2025 22:35

cleanup

5ebd3d0

SherlockNoMad commented Sep 29, 2025

View reviewed changes

revert register_constant

949ea83

tugsbayasgalan approved these changes Sep 30, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 30, 2025

pytorchmergebot added the merging label Sep 30, 2025

pytorchmergebot added the Merged label Sep 30, 2025

pytorchmergebot closed this in 6e5b424 Sep 30, 2025

pytorchmergebot removed the merging label Sep 30, 2025

This was referenced Sep 30, 2025

Support propagating custom meta field to backward graph nodes #164174

Closed

[Compiler Toolkit] JointGraph-based Training Prototype for llama3 pytorch/torchtitan#1794

Merged

github-actions bot deleted the bahuang/export_dtensor branch October 31, 2025 02:17

aorenste mentioned this pull request Mar 2, 2026

Fix test interaction: clean up DTensorSpec pytree registration #176128

Closed

Conversation

SherlockNoMad commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163609

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezyang commented Sep 24, 2025

Uh oh!

Uh oh!

tugsbayasgalan Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

tugsbayasgalan Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SherlockNoMad Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

tugsbayasgalan Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad commented Sep 30, 2025

Uh oh!

pytorchmergebot commented Sep 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SherlockNoMad commented Sep 23, 2025 •

edited

Loading

pytorch-bot bot commented Sep 23, 2025 •

edited

Loading