[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel by yf225 · Pull Request #177492 · pytorch/pytorch

yf225 · 2026-03-15T20:31:02Z

Stack from ghstack (oldest at bottom):

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion #177065
-> [Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel #177492

Add fusion methods to ExternalTritonTemplateKernel:

_compute_fusion_metadata(): determines eligible epilogues/prologues,
builds epilogue specs, and computes prologue sources before render()
_setup_fusion_hooks(): sets up epilogue/prologue render hooks and
marks prologue buffers during render() (inside V.kernel context)
_find_eligible_epilogues(): computes fusion eligibility and registers
extra inputs needed by fused epilogues
_setup_epilogue_hook(): creates independent subgraph with contiguous
index state for epilogue codegen
_setup_prologue_hook(): creates subgraph with capture store handler
for prologue codegen, with variable renaming to avoid collisions

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Differential Revision: D96779294

…onTemplateKernel Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions [ghstack-poisoned]

pytorch-bot · 2026-03-15T20:31:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177492

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 11 Pending, 1 Unrelated Failure

As of commit 392db5e with merge base d1f78bd ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

torchtitan-test / torchtitan-x-pytorch-test / test (torchtitan_features_integration, 1, 1, linux.g5.48xlarge.nvidia.gpu) (gh) (trunk failure)
RuntimeError: 1 test steps failed: ['scripts/ci/pytorch_ci_test_runner.sh feature_tests']

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-15T20:31:09Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…xternalTritonTemplateKernel" Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

yf225 · 2026-03-16T18:54:39Z

@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

yf225 · 2026-03-16T20:37:20Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2026-03-16T20:50:36Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…el fusion (#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: #177065 Approved by: https://github.com/jansel ghstack dependencies: #177492

…tput templates (#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: #177597 Approved by: https://github.com/shunting314 ghstack dependencies: #177492, #177065

…onTemplateKernel (pytorch#177492) Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions Differential Revision: [D96779294](https://our.internmc.facebook.com/intern/diff/D96779294) Pull Request resolved: pytorch#177492 Approved by: https://github.com/jansel

…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492

…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065

…onTemplateKernel (pytorch#177492) Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions Differential Revision: [D96779294](https://our.internmc.facebook.com/intern/diff/D96779294) Pull Request resolved: pytorch#177492 Approved by: https://github.com/jansel

…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492

…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065

yf225 mentioned this pull request Mar 15, 2026

[Helion + torch.compile] Extend TemplateBuffer and scheduler for external backends #177491

Closed

pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: inductor labels Mar 15, 2026

yf225 mentioned this pull request Mar 15, 2026

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion #177065

Closed

yf225 added the topic: not user facing topic category label Mar 16, 2026

yf225 requested review from eellison, jansel, oulgen and shunting314 March 16, 2026 04:43

jansel approved these changes Mar 16, 2026

View reviewed changes

yf225 added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 16, 2026

pytorchmergebot added the merging label Mar 16, 2026

pytorchmergebot closed this in 358b1b3 Mar 16, 2026

pytorchmergebot added Merged and removed merging labels Mar 16, 2026

This was referenced Mar 16, 2026

[Helion + torch.compile] Fix prologue fusion dtype check for multi-output templates #177597

Closed

[Helion + torch.compile] Fix prologue fusion dtype check for multi-output templates #177598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel#177492

[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel#177492
yf225 wants to merge 3 commits intogh/yf225/143/basefrom
gh/yf225/143/head

yf225 commented Mar 15, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 15, 2026

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yf225 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177492

⏳ 11 Pending, 1 Unrelated Failure

Uh oh!

pytorch-bot bot commented Mar 15, 2026

This PR needs a release notes: label

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yf225 commented Mar 15, 2026 •

edited

Loading

pytorch-bot bot commented Mar 15, 2026 •

edited

Loading

This PR needs a `release notes:` label