[Helion + torch.compile] Refactor template codegen pipeline for extensibility by yf225 · Pull Request #177064 · pytorch/pytorch

yf225 · 2026-03-10T20:08:44Z

Stack from ghstack (oldest at bottom):

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion #177065
-> [Helion + torch.compile] Refactor template codegen pipeline for extensibility #177064

Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

Move epilogue/prologue codegen out of _codegen_single_template (in
simd.py) and into TritonTemplateKernel.codegen_template_body(), so
each kernel subclass owns its own source generation.
_codegen_single_template now only handles the shared orchestration:
prologue-fused input cleanup, benchmark wrapping, mark_run, and
define_kernel.
Rename SIMDKernel.codegen_template_override → codegen_template_body
and add get_unfused_epilogues() hook, giving subclasses two clear
extension points.
Add PartialRender._replace_placeholder() for indent-aware hook
substitution, replacing ad-hoc indent arithmetic scattered across
load_input() / store_output() / CuteDSL unpack_buffers.
Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
_make_independent_subgraph, _compute_fusion_metadata,
codegen_prologues_in_subgraphs) from the monolithic load_input /
store_output methods so they can be reused or overridden by external
kernel classes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Differential Revision: D96526007

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() [ghstack-poisoned]

pytorch-bot · 2026-03-10T20:08:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177064

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 72eb29f with merge base edf1a92 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_pca_lowrank_cuda_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-10T20:08:52Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: bbaabe3 Pull Request resolved: #177064

…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: bbaabe3 Pull Request resolved: #177064

…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

shunting314 · 2026-03-11T17:29:50Z

torch/_inductor/codegen/simd.py

        self.saved_partial_accumulate: list[PartialAccumulate] = []

-    def codegen_template_override(
+    def codegen_template_body(


Does helion already use codegen_template_override for integration?

With this new design, I am also planning to change Helion's integration to use codegen_template_body (will be done in Helion PR pytorch/helion#1520). Currently all Helion+torch.compile integration tests are disabled so removing this codegen_template_override extension point should be safe to do

shunting314 · 2026-03-11T17:34:01Z

torch/_inductor/select_algorithm.py

+        idx = self._code.find(hook_key)
+        if idx < 0:
+            return self._code.replace(hook_key, result)


Get a bit confused. By do the replacement if the key is not found. You actually want to return immediately?

Yes good catch - updated this helper function to fix the issue and also made the logic more clear

…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064

…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

pytorchmergebot · 2026-03-12T06:26:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-12T07:14:25Z

Merge failed

Reason: 4 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx950.1), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064

…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: c7959b4 Pull Request resolved: pytorch/pytorch#177064

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064

…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 21d537d Pull Request resolved: #177064

…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

yf225 · 2026-03-13T21:11:01Z

@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

yf225 · 2026-03-14T16:17:59Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2026-03-14T16:19:48Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064

…sibility (pytorch#177064) Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. Differential Revision: [D96526007](https://our.internmc.facebook.com/intern/diff/D96526007) Pull Request resolved: pytorch#177064 Approved by: https://github.com/jansel

pytorch-bot bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests module: inductor labels Mar 10, 2026

This was referenced Mar 10, 2026

[Helion + torch.compile] Fix MultiOutput write deps to eliminate fusion workarounds #177062

Closed

[Helion + torch.compile] Refactor TemplateBuffer as extensible base class #177063

Closed

yf225 mentioned this pull request Mar 10, 2026

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion #177065

Closed

yf225 changed the title ~~Refactor template codegen pipeline for extensibility~~ [Helion + torch.compile] Refactor template codegen pipeline for extensibility Mar 10, 2026

yf225 added the topic: not user facing topic category label Mar 10, 2026

yf225 force-pushed the gh/yf225/136/head branch from 27c83b4 to 909d1ba Compare March 11, 2026 06:00

yf225 added 2 commits March 10, 2026 23:01

yf225 requested review from eellison, jansel, oulgen and shunting314 March 11, 2026 07:14

shunting314 reviewed Mar 11, 2026

View reviewed changes

yf225 added 3 commits March 11, 2026 14:51

pytorchmergebot removed the merging label Mar 12, 2026

yf225 added 2 commits March 12, 2026 11:54

yf225 mentioned this pull request Mar 12, 2026

[Helion + torch.compile] Fix MultiOutput write deps and extend fusion score matching #177302

Closed

yf225 mentioned this pull request Mar 12, 2026

[Helion + torch.compile] Normalize MultiOutput write deps for consistent fusion matching #177315

Closed

yf225 added 2 commits March 12, 2026 14:00

yf225 mentioned this pull request Mar 13, 2026

[Re-land] [Helion + torch.compile] Refactor TemplateBuffer as extensible base class #177367

Closed

yf225 added 3 commits March 12, 2026 21:44

pytorchmergebot added the merging label Mar 14, 2026

pytorchmergebot closed this in c4d3dcd Mar 14, 2026

pytorchmergebot added Merged and removed merging labels Mar 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064

[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064
yf225 wants to merge 27 commits intogh/yf225/136/basefrom
gh/yf225/136/head

yf225 commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 10, 2026

Uh oh!

shunting314 Mar 11, 2026

Uh oh!

yf225 Mar 11, 2026

Uh oh!

shunting314 Mar 11, 2026

Uh oh!

yf225 Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

pytorchmergebot commented Mar 12, 2026

Uh oh!

yf225 commented Mar 13, 2026

Uh oh!

yf225 commented Mar 14, 2026

Uh oh!

pytorchmergebot commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yf225 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177064

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

Uh oh!

shunting314 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

yf225 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

shunting314 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

yf225 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Mar 12, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 12, 2026

Merge failed

Uh oh!

yf225 commented Mar 13, 2026

Uh oh!

yf225 commented Mar 14, 2026

Uh oh!

pytorchmergebot commented Mar 14, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yf225 commented Mar 10, 2026 •

edited

Loading

pytorch-bot bot commented Mar 10, 2026 •

edited

Loading

This PR needs a `release notes:` label