[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064
[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064yf225 wants to merge 27 commits intogh/yf225/136/basefrom
Conversation
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177064
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 72eb29f with merge base edf1a92 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: bbaabe3 Pull Request resolved: #177064
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: bbaabe3 Pull Request resolved: #177064
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
27c83b4 to
909d1ba
Compare
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
| self.saved_partial_accumulate: list[PartialAccumulate] = [] | ||
|
|
||
| def codegen_template_override( | ||
| def codegen_template_body( |
There was a problem hiding this comment.
Does helion already use codegen_template_override for integration?
There was a problem hiding this comment.
With this new design, I am also planning to change Helion's integration to use codegen_template_body (will be done in Helion PR pytorch/helion#1520). Currently all Helion+torch.compile integration tests are disabled so removing this codegen_template_override extension point should be safe to do
torch/_inductor/select_algorithm.py
Outdated
| idx = self._code.find(hook_key) | ||
| if idx < 0: | ||
| return self._code.replace(hook_key, result) |
There was a problem hiding this comment.
Get a bit confused. By do the replacement if the key is not found. You actually want to return immediately?
There was a problem hiding this comment.
Yes good catch - updated this helper function to fix the issue and also made the logic more clear
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 4 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx950.1), trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 4, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu), trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 1, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: c7959b4 Pull Request resolved: pytorch/pytorch#177064
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…sibility Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 21d537d Pull Request resolved: #177064
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…e for extensibility" Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
|
@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Restructure the template code generation pipeline to separate concerns and enable external template backends: - Rename SIMDKernel.codegen_template_override → codegen_template_body (now raises NotImplementedError; actual impl in TritonTemplateKernel) - Add get_unfused_epilogues() for epilogues that need separate codegen - Move epilogue/prologue codegen from _codegen_single_template into TritonTemplateKernel.codegen_template_body() - Simplify _codegen_single_template to dispatch to kernel, handle benchmark wrapping, mark_run, and define_kernel - Add PartialRender._replace_placeholder() for indent-aware hook substitution (replaces manual indent handling) - Extract _setup_contiguous_index_state(), _make_independent_subgraph(), _make_codegen_hook() helpers from load_input/store_output - Add SubgraphInfo.root_var_renames for prologue variable renaming - Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs() ghstack-source-id: 4a4a996 Pull Request resolved: pytorch#177064
…sibility (pytorch#177064) Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. Differential Revision: [D96526007](https://our.internmc.facebook.com/intern/diff/D96526007) Pull Request resolved: pytorch#177064 Approved by: https://github.com/jansel
…sibility (pytorch#177064) Restructure the template code generation pipeline so that external template backends (e.g. Helion) can participate in epilogue/prologue fusion without duplicating the Triton-specific codegen logic. Key changes: - Move epilogue/prologue codegen out of _codegen_single_template (in simd.py) and into TritonTemplateKernel.codegen_template_body(), so each kernel subclass owns its own source generation. _codegen_single_template now only handles the shared orchestration: prologue-fused input cleanup, benchmark wrapping, mark_run, and define_kernel. - Rename SIMDKernel.codegen_template_override → codegen_template_body and add get_unfused_epilogues() hook, giving subclasses two clear extension points. - Add PartialRender._replace_placeholder() for indent-aware hook substitution, replacing ad-hoc indent arithmetic scattered across load_input() / store_output() / CuteDSL unpack_buffers. - Extract helpers (_setup_contiguous_index_state, _make_codegen_hook, _make_independent_subgraph, _compute_fusion_metadata, codegen_prologues_in_subgraphs) from the monolithic load_input / store_output methods so they can be reused or overridden by external kernel classes. Differential Revision: [D96526007](https://our.internmc.facebook.com/intern/diff/D96526007) Pull Request resolved: pytorch#177064 Approved by: https://github.com/jansel
Stack from ghstack (oldest at bottom):
Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.
Key changes:
Move epilogue/prologue codegen out of _codegen_single_template (in
simd.py) and into TritonTemplateKernel.codegen_template_body(), so
each kernel subclass owns its own source generation.
_codegen_single_template now only handles the shared orchestration:
prologue-fused input cleanup, benchmark wrapping, mark_run, and
define_kernel.
Rename SIMDKernel.codegen_template_override → codegen_template_body
and add get_unfused_epilogues() hook, giving subclasses two clear
extension points.
Add PartialRender._replace_placeholder() for indent-aware hook
substitution, replacing ad-hoc indent arithmetic scattered across
load_input() / store_output() / CuteDSL unpack_buffers.
Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
_make_independent_subgraph, _compute_fusion_metadata,
codegen_prologues_in_subgraphs) from the monolithic load_input /
store_output methods so they can be reused or overridden by external
kernel classes.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo
Differential Revision: D96526007