[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion by yf225 · Pull Request #177065 · pytorch/pytorch

yf225 · 2026-03-10T20:08:48Z

Stack from ghstack (oldest at bottom):

-> [Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion #177065
[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel #177492

Add test_external_template_prologue_epilogue_fusion that exercises:

Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
Extra inputs: bias is read by the epilogue but is not among the
template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Differential Revision: D96849203

Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check [ghstack-poisoned]

pytorch-bot · 2026-03-10T20:08:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177065

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 4 Pending, 3 Unrelated Failures

As of commit 6c88a20 with merge base d1f78bd ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (disabled by #176109)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_multi_dot_cuda_float32

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

torchtitan-test / torchtitan-x-pytorch-test / test (torchtitan_features_integration, 1, 1, linux.g5.48xlarge.nvidia.gpu) (gh) (trunk failure)
RuntimeError: 1 test steps failed: ['scripts/ci/pytorch_ci_test_runner.sh feature_tests']

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-10T20:08:56Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: 094f66e Pull Request resolved: #177065

…tes" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: 3753929 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: c0d1028 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: 87c5cfe Pull Request resolved: #177065

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: c0d1028 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: f139305 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: 929a239 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: fa84eba Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…er fusion updates Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: c1e6a3b Pull Request resolved: #177065

Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check ghstack-source-id: fa84eba Pull Request resolved: pytorch#177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…teKernel fusion Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel, testing the full render-based fusion pipeline with a mock triton template. ghstack-source-id: c4d94e0 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D96563983](https://our.internmc.facebook.com/intern/diff/D96563983) [ghstack-poisoned]

…teKernel fusion Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. ghstack-source-id: 4b7c486 Pull Request resolved: #177065

…and scheduler fusion updates" Add ExternalTritonTemplateKernel class that subclasses TritonTemplateKernel for external template backends (e.g. Helion). Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(), _find_eligible_epilogues(), _setup_epilogue_hook(), _setup_prologue_hook(), call_kernel(), emit_kernel_override(). Extend TemplateBuffer base class with fields needed by external backends: epilogue_fusable_outputs, _multi_output_children, _named_inputs, and add realize_template_input() and build_multi_outputs() class methods. Add MultiOutputLayout handling to extract_read_writes(). Set epilogue_fusable_outputs in TritonTemplateBuffer. Scheduler changes: - Generalize prologue fusion: check allowed_prologue_inps instead of isinstance(TritonTemplateBuffer) - Add multi-output template epilogue guard requiring ComputedBuffer - Replace can_fuse_multi_output_epilogue delegation with inline MultiOutput parent check cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D96563983](https://our.internmc.facebook.com/intern/diff/D96563983) [ghstack-poisoned]

…teKernel fusion Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. ghstack-source-id: 86661b2 Pull Request resolved: #177065

…emplateKernel fusion" Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…teKernel fusion Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. ghstack-source-id: 1a84b60 Pull Request resolved: #177065

yf225 · 2026-03-16T21:14:15Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2026-03-16T21:16:07Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

yf225 · 2026-03-17T00:47:49Z

@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…tput templates (#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: #177597 Approved by: https://github.com/shunting314 ghstack dependencies: #177492, #177065

…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492

…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065

…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492

…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065

pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: inductor labels Mar 10, 2026

yf225 changed the title ~~Add ExternalTritonTemplateKernel and scheduler fusion updates~~ [Helion + torch.compile] Add ExternalTritonTemplateKernel and scheduler fusion updates Mar 10, 2026

yf225 added the topic: not user facing topic category label Mar 10, 2026

This was referenced Mar 15, 2026

[Helion + torch.compile] Extend TemplateBuffer and scheduler for external backends #177491

Closed

[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel #177492

Closed

yf225 changed the title ~~[Helion + torch.compile] Add ExternalTritonTemplateKernel and scheduler fusion updates~~ [Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion Mar 16, 2026

yf225 requested review from eellison, jansel, oulgen and shunting314 March 16, 2026 04:47

jansel approved these changes Mar 16, 2026

View reviewed changes

pytorchmergebot added the merging label Mar 16, 2026

pytorchmergebot closed this in 388f1fa Mar 16, 2026

pytorchmergebot added Merged and removed merging labels Mar 16, 2026

This was referenced Mar 16, 2026

[Helion + torch.compile] Fix prologue fusion dtype check for multi-output templates #177597

Closed

[Helion + torch.compile] Fix prologue fusion dtype check for multi-output templates #177598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion#177065

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion#177065
yf225 wants to merge 34 commits intogh/yf225/137/basefrom
gh/yf225/137/head

yf225 commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Uh oh!

yf225 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yf225 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177065

⏳ 4 Pending, 3 Unrelated Failures

Uh oh!

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

Uh oh!

yf225 commented Mar 16, 2026

Uh oh!

pytorchmergebot commented Mar 16, 2026

Merge started

Uh oh!

yf225 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yf225 commented Mar 10, 2026 •

edited

Loading

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

This PR needs a `release notes:` label