Skip to content

[Helion + torch.compile] Add ExternalTritonTemplateKernel for external template prologue/epilogue fusion#176571

Closed
yf225 wants to merge 2 commits intomainfrom
helion_inductor_fusion_pr1
Closed

[Helion + torch.compile] Add ExternalTritonTemplateKernel for external template prologue/epilogue fusion#176571
yf225 wants to merge 2 commits intomainfrom
helion_inductor_fusion_pr1

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Mar 5, 2026

Add support for external template backends (e.g. Helion) to fuse prologue and epilogue pointwise ops into their kernels via Inductor's existing fusion infrastructure.

Template Codegen Lifecycle
==========================

_codegen_single_template (simd.py)
├── Build prologue groups from prologue_nodes
├── Remove prologue-fused inputs from kernel.args.input_buffers
├── kernel.codegen_template_body(...)  →  dispatches by kernel type:
│   │
│   ├── [Standard] TritonTemplateKernel
│   │   ├── with self:
│   │   │   ├── render()  →  PartialRender with hook placeholders
│   │   │   ├── Codegen ALL epilogues into each store subgraph
│   │   │   └── codegen_prologues_in_subgraphs
│   │   └── Finalize hooks (<DEF_KERNEL>, <ARGDEFS>, <LOAD_INPUT_*>, <STORE_OUTPUT_*>)
│   │   └── return src_code
│   │
│   └── [External] ExternalTritonTemplateKernel
│       ├── Build prologue list from prologue groups
│       ├── _find_eligible_epilogues  →  epilogues reading exactly 1 template output
│       ├── Compute _unfused_epilogues  →  everything else (non-MultiOutput)
│       ├── Build prologue_sources dict: buf_name → source_bufs
│       ├── with self:
│       │   ├── _setup_epilogue_hook  →  one per epilogue-fusable output
│       │   ├── _setup_prologue_hook  →  one per named input with a prologue
│       │   ├── Codegen each eligible epilogue into its store subgraph
│       │   └── codegen_prologues_in_subgraphs
│       ├── _build_fusion_spec  →  TemplateFusionSpec
│       ├── tb.fuse(spec)  →  TemplateFusionOutput (backend splices into Triton AST)
│       ├── Finalize hook placeholders (_STORE_OUTPUT_*, _LOAD_INPUT_*)
│       └── return src_code
│
├── kernel.get_unfused_epilogues()
│   ├── [Standard]  →  []
│   └── [External]  →  self._unfused_epilogues
│
├── mark_run on template_node, fused epilogues, prologues
├── define_kernel(src_code)
└── return kernel

call_kernel
├── [Standard] Emit kernel call + workspace deallocation
└── [External] Emit kernel call + multi-output unpacking + codegen_node for each unfused epilogue

Helion-side changes are in pytorch/helion#1520.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176571

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 9 Unrelated Failures

As of commit 6ccaafd with merge base 6ad9c43 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@yf225 yf225 changed the title [Helion + torch.compile] Add ExternalTritonTemplateKernel for external template prologue/epilogue fusion [Helion + torch.compile] Add ExternalTritonTemplateKernel for external template prologue/epilogue fusion Mar 5, 2026
@yf225 yf225 added the topic: not user facing topic category label Mar 5, 2026
@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch 6 times, most recently from 971cd8d to 2229e9d Compare March 6, 2026 06:52
@@ -463,7 +463,7 @@ def simplify_indexing(index: sympy.Expr):
self.rsplit_size = 0
self.saved_partial_accumulate: list[PartialAccumulate] = []

def codegen_template_override(
def codegen_template_body(
Copy link
Copy Markdown
Contributor Author

@yf225 yf225 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall codegen lifecycle:

_codegen_single_template (simd.py)
├── Build prologue groups from prologue_nodes
├── Remove prologue-fused inputs from kernel.args.input_buffers
├── kernel.codegen_template_body(...)  →  dispatches by kernel type:
│   │
│   ├── [Standard] TritonTemplateKernel
│   │   ├── with self:
│   │   │   ├── render()  →  PartialRender with hook placeholders
│   │   │   ├── Codegen ALL epilogues into each store subgraph
│   │   │   └── codegen_prologues_in_subgraphs
│   │   └── Finalize hooks (<DEF_KERNEL>, <ARGDEFS>, <LOAD_INPUT_*>, <STORE_OUTPUT_*>)
│   │   └── return src_code
│   │
│   └── [External] ExternalTritonTemplateKernel
│       ├── Build prologue list from prologue groups
│       ├── _find_eligible_epilogues  →  epilogues reading exactly 1 template output
│       ├── Compute _unfused_epilogues  →  everything else (non-MultiOutput)
│       ├── Build prologue_sources dict: buf_name → source_bufs
│       ├── with self:
│       │   ├── _setup_epilogue_hook  →  one per epilogue-fusable output
│       │   ├── _setup_prologue_hook  →  one per named input with a prologue
│       │   ├── Codegen each eligible epilogue into its store subgraph
│       │   └── codegen_prologues_in_subgraphs
│       ├── _build_fusion_spec  →  TemplateFusionSpec
│       ├── tb.fuse(spec)  →  TemplateFusionOutput (backend splices into Triton AST)
│       ├── Finalize hook placeholders (_STORE_OUTPUT_*, _LOAD_INPUT_*)
│       └── return src_code
│
├── kernel.get_unfused_epilogues()
│   ├── [Standard]  →  []
│   └── [External]  →  self._unfused_epilogues
│
├── mark_run on template_node, fused epilogues, prologues
├── define_kernel(src_code)
└── return kernel

call_kernel
├── [Standard] Emit kernel call + workspace deallocation
└── [External] Emit kernel call + multi-output unpacking + codegen_node for each unfused epilogue

@@ -463,7 +463,7 @@ def simplify_indexing(index: sympy.Expr):
self.rsplit_size = 0
self.saved_partial_accumulate: list[PartialAccumulate] = []

def codegen_template_override(
Copy link
Copy Markdown
Contributor Author

@yf225 yf225 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentionally removed in favor of codegen_template_body() as the new extension point.

@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch 12 times, most recently from 7b775e9 to 6529c27 Compare March 6, 2026 19:48
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentionally removed as it's no longer needed - instead we directly add the ir.MultiOutput handling logic in can_fuse_multi_outputs_template().

@yf225 yf225 marked this pull request as ready for review March 6, 2026 21:27
@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch 2 times, most recently from 545a723 to 8040ee5 Compare March 6, 2026 22:18
@yf225 yf225 marked this pull request as draft March 7, 2026 02:03
@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch 3 times, most recently from b93b7c4 to 0b9593c Compare March 7, 2026 04:40
@yf225 yf225 marked this pull request as ready for review March 7, 2026 05:29
@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch from 0b9593c to f4dba7f Compare March 7, 2026 07:27
@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 8, 2026

@pytorchbot rebase

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased helion_inductor_fusion_pr1 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout helion_inductor_fusion_pr1 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the helion_inductor_fusion_pr1 branch from f4dba7f to 8217bdf Compare March 8, 2026 04:00
@yf225 yf225 marked this pull request as draft March 9, 2026 01:42
@yf225 yf225 force-pushed the helion_inductor_fusion_pr1 branch from 8217bdf to 6ccaafd Compare March 9, 2026 21:54
@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 10, 2026

Closing this PR in favor of this PR stack: #177062

@yf225 yf225 closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants