[Compiler Toolkit] JointGraph-based Training Prototype for llama3 by SherlockNoMad · Pull Request #1794 · pytorch/torchtitan

SherlockNoMad · 2025-10-03T07:17:48Z

This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow.

Setup: shard_dp = 2, tp = 4.

MVP

[Done] Start with a simpleFSDP model, enable TP + FSDP
[Done] Apply aot_export_joing_with_descriptor on parallelized module with DTensor input to get the joint graph
[Done] Apply min_cut_partitioner to get forward and backward graph module
[Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives
[Done] Run the joint graph with aot_compile_joint_with_descriptors
[Done] Region Inductor for FlexAttention, need to run on top of [annotate] Annotation should be mapped across submod pytorch#165202 and [compile] Regional inductor compilation with fx.annotate pytorch#164776

Nest Step

Enable CudaGraph
Enable SimpleFSDP + EP
Showcase user annotation on MoE for dispatch, compute, combine region
Enable PP with custom Runner

Issues

fwd_rng_state show up in the aot_export_joint grpah input pytorch#164559
[DTensor] Improve Sharding propagation error message pytorch#164543
What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong.

Repro steps:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

Run with FlexAttention:
NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral
lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn

Sample output:
P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py

tianyu-l

Is this for exploration purpose? If so I'd suggest we work in a branch / fork.

torchtitan/experiments/joint_graph_runner/llama3/parallelize.py

ezyang · 2025-10-15T03:42:01Z

cc @bobrenjc93 @aorenste this might be a good way to look at compile on one rank, perhaps??

SherlockNoMad · 2025-10-15T04:04:30Z

cc @bobrenjc93 @aorenste this might be a good way to look at compile on one rank, perhaps??

Yes, I have examples on how graphs are different on different ranks.
We can investigate on how to parameterize them.

P1975157784: rank0_autograd_function_0fea2786.py
P1975158481: rank1_autograd_function_28587623.py

torchtitan/models/attention.py

torchtitan/experiments/compiler_toolkit/README.md

yiming0416 · 2025-10-28T00:11:59Z

torchtitan/experiments/compiler_toolkit/README.md

+**SimpleFSDP + TP + EP**
 ```shell
-NGPU=4 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.deepseek_v3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=2 --parallelism.expert_parallel_degree=2 --activation_checkpoint.mode none
+NGPU=4 CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.deepseek_v3 --compile.enable --compile.backend "aot_eager" --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=2 --parallelism.expert_parallel_degree=2 --activation_checkpoint.mode none


@SherlockNoMad I think we should remove --compile.enable as well as --compile.backend here.
Since we are applying the customized CompiledModule, we don't really want torch.compile() the model.

torchtitan/experiments/compiler_toolkit/llama3/parallelize.py

…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 3, 2025

SherlockNoMad changed the title ~~Joint Graph Runner~~ JointGraph-based Training Prototype Oct 3, 2025

SherlockNoMad mentioned this pull request Oct 3, 2025

fwd_rng_state show up in the aot_export_joint grpah input pytorch/pytorch#164559

Open

yiming0416 mentioned this pull request Oct 8, 2025

Add joint graph runner deepseek_v3 experiment #1841

Draft

SherlockNoMad marked this pull request as ready for review October 9, 2025 00:31

SherlockNoMad requested review from anijain2305, ezyang and tianyu-l October 9, 2025 17:20

tianyu-l requested changes Oct 9, 2025

View reviewed changes

SherlockNoMad requested review from fegin, wconstab and wwwjn as code owners October 10, 2025 04:58

SherlockNoMad commented Oct 10, 2025

View reviewed changes

torchtitan/experiments/joint_graph_runner/llama3/parallelize.py Outdated Show resolved Hide resolved

fegin reviewed Oct 17, 2025

View reviewed changes

torchtitan/models/attention.py Outdated Show resolved Hide resolved

SherlockNoMad changed the title ~~JointGraph-based Training Prototype~~ [Compiler Toolkit] JointGraph-based Training Prototype for llama3 Oct 27, 2025

SherlockNoMad force-pushed the joint_graph_runner branch from 84beccf to 8f91a91 Compare October 27, 2025 22:23

SherlockNoMad commented Oct 27, 2025

View reviewed changes

torchtitan/models/attention.py Outdated Show resolved Hide resolved

xmfan and others added 11 commits October 27, 2025 15:46

Fork SimpleFSDP

009815e

Hijack the execution flow for a single training loop

b072781

Introduce Joint Graph Runner

9a383d7

convert inputs into DTensor

be51561

fixes

6e4f91b

use aot_compile_joint_with_descriptors

a3095c0

apply fw/bw compiler

e8d10db

apply schedule_overlap_bucketing

fec4864

Clean up

e52feca

lint

1c398c7

patch _restore_state_dict

38c1ac6

SherlockNoMad added 5 commits October 27, 2025 15:46

regional inductor worked

217afbf

rename

62d4d89

refactor

1594ab9

lint

2b3a114

refactor

4c9c6d6

SherlockNoMad force-pushed the joint_graph_runner branch from 8f91a91 to 38a37fd Compare October 27, 2025 23:05

SherlockNoMad requested a review from tianyu-l October 27, 2025 23:05

fix

1af97eb

SherlockNoMad force-pushed the joint_graph_runner branch from 38a37fd to 1af97eb Compare October 27, 2025 23:07

xmfan approved these changes Oct 27, 2025

View reviewed changes

ruisizhang123 reviewed Oct 27, 2025

View reviewed changes

torchtitan/experiments/compiler_toolkit/README.md Outdated Show resolved Hide resolved

address comment

1150741

tianyu-l approved these changes Oct 27, 2025

View reviewed changes

yiming0416 reviewed Oct 28, 2025

View reviewed changes

torchtitan/experiments/compiler_toolkit/llama3/parallelize.py Show resolved Hide resolved

address comments

fe08006

SherlockNoMad merged commit 06ec495 into main Oct 28, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Compiler Toolkit] JointGraph-based Training Prototype for llama3#1794

[Compiler Toolkit] JointGraph-based Training Prototype for llama3#1794
SherlockNoMad merged 19 commits intomainfrom
joint_graph_runner

SherlockNoMad commented Oct 3, 2025 •

edited

Loading

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

ezyang commented Oct 15, 2025

Uh oh!

SherlockNoMad commented Oct 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiming0416 Oct 28, 2025

Uh oh!

SherlockNoMad Oct 28, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

SherlockNoMad commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezyang commented Oct 15, 2025

Uh oh!

SherlockNoMad commented Oct 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiming0416 Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

SherlockNoMad Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SherlockNoMad commented Oct 3, 2025 •

edited

Loading