[annotate] Annotation should be mapped across submod#165202
[annotate] Annotation should be mapped across submod#165202
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/165202
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 676774b with merge base 37d57ac ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
6ef57a6 to
fc75cff
Compare
SherlockNoMad
left a comment
There was a problem hiding this comment.
lgtm with minor comments.
get_custom_metadata doesn't seems useful to user, let's make it private.
a0c8704 to
431e6b8
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: Command Details for Dev Infra teamRaised by workflow job |
c82aa30 to
e1e6e87
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
The match for backward nodes might be in a different submod, so we should check all submod for potential matches. In flex attention, this could happen if `mask_mod` has operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph. ``` python test/functorch/test_aot_joint_with_descriptors.py -k preserve_annotate ``` Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now. ``` NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" LOG_RANK=0 TRAIN_FILE="torchtitan.train" TORCHFT_LIGHTHOUSE="http://localhost:29510" PYTORCH_ALLOC_CONF="expandable_segments:True" torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint="localhost:0" --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn ``` Pull Request resolved: pytorch#165202 Approved by: https://github.com/SherlockNoMad
The match for backward nodes might be in a different submod, so we should check all submod for potential matches. In flex attention, this could happen if `mask_mod` has operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph. ``` python test/functorch/test_aot_joint_with_descriptors.py -k preserve_annotate ``` Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now. ``` NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" LOG_RANK=0 TRAIN_FILE="torchtitan.train" TORCHFT_LIGHTHOUSE="http://localhost:29510" PYTORCH_ALLOC_CONF="expandable_segments:True" torchrun --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint="localhost:0" --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml --model.name joint_graph_runner.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn ``` Pull Request resolved: pytorch#165202 Approved by: https://github.com/SherlockNoMad
) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
…torch#1794) This is an e2e prototype to run llama3-simplefsdp using export-y aot_autograd workflow. Setup: shard_dp = 2, tp = 4. MVP - [Done] Start with a simpleFSDP model, enable TP + FSDP - [Done] Apply [aot_export_joing_with_descriptor](pytorch/pytorch#163609) on parallelized module with DTensor input to get the joint graph - [Done] Apply min_cut_partitioner to get forward and backward graph module - [Done but Need verification] Apply prefect/bucketing graph passes on fw_gm and bw_gm to reorder/group the communication collectives - [Done] Run the joint graph with `aot_compile_joint_with_descriptors` - [Done] Region Inductor for FlexAttention, need to run on top of pytorch/pytorch#165202 and pytorch/pytorch#164776 Nest Step - Enable CudaGraph - Enable SimpleFSDP + EP - Showcase user annotation on MoE for dispatch, compute, combine region - Enable PP with custom Runner Issues - pytorch/pytorch#164559 - pytorch/pytorch#164543 - What's input order for aot_export_joint graph? using model.parameter() 's order as input seems wrong. Repro steps: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 Run with FlexAttention: NGPU=8 CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_paral lel_shard_degree=2 --parallelism.tensor_parallel_degree=4 --model.flavor=debugmodel_flex_attn Sample output: P1975157784: rank0_autograd_function_0fea2786.py P1975158481: rank1_autograd_function_28587623.py --------- Co-authored-by: Simon Fan <xmfan@meta.com>
The match for backward nodes might be in a different submod, so we should check all submod for potential matches.
In flex attention, this could happen if
mask_modhas operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph.Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @Lucaskabela