[Compiler Toolkit] JointGraph-based Training Prototype for llama3#1794
Merged
SherlockNoMad merged 19 commits intomainfrom Oct 28, 2025
Merged
[Compiler Toolkit] JointGraph-based Training Prototype for llama3#1794SherlockNoMad merged 19 commits intomainfrom
SherlockNoMad merged 19 commits intomainfrom
Conversation
tianyu-l
requested changes
Oct 9, 2025
Contributor
tianyu-l
left a comment
There was a problem hiding this comment.
Is this for exploration purpose? If so I'd suggest we work in a branch / fork.
SherlockNoMad
commented
Oct 10, 2025
torchtitan/experiments/joint_graph_runner/llama3/parallelize.py
Outdated
Show resolved
Hide resolved
Contributor
|
cc @bobrenjc93 @aorenste this might be a good way to look at compile on one rank, perhaps?? |
Contributor
Author
Yes, I have examples on how graphs are different on different ranks. P1975157784: rank0_autograd_function_0fea2786.py |
fegin
reviewed
Oct 17, 2025
84beccf to
8f91a91
Compare
SherlockNoMad
commented
Oct 27, 2025
8f91a91 to
38a37fd
Compare
38a37fd to
1af97eb
Compare
xmfan
approved these changes
Oct 27, 2025
tianyu-l
approved these changes
Oct 27, 2025
yiming0416
reviewed
Oct 28, 2025
| **SimpleFSDP + TP + EP** | ||
| ```shell | ||
| NGPU=4 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.deepseek_v3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=2 --parallelism.expert_parallel_degree=2 --activation_checkpoint.mode none | ||
| NGPU=4 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.deepseek_v3 --compile.enable --compile.backend "aot_eager" --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=2 --parallelism.expert_parallel_degree=2 --activation_checkpoint.mode none |
Contributor
There was a problem hiding this comment.
@SherlockNoMad I think we should remove --compile.enable as well as --compile.backend here.
Since we are applying the customized CompiledModule, we don't really want torch.compile() the model.
yiming0416
reviewed
Oct 28, 2025
jquesnelle
pushed a commit
to NousResearch/torchtitan
that referenced
this pull request
Nov 10, 2025
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
xrsrke
pushed a commit
to NousResearch/torchtitan
that referenced
this pull request
Feb 13, 2026
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
xrsrke
pushed a commit
to NousResearch/torchtitan
that referenced
this pull request
Feb 25, 2026
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow.
Setup: shard_dp = 2, tp = 4.
MVP
aot_compile_joint_with_descriptorsNest Step
Issues
Repro steps:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4
Run with FlexAttention:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral
lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn
Sample output:
P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py