Skip to content

[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064

Closed
yf225 wants to merge 27 commits intogh/yf225/136/basefrom
gh/yf225/136/head
Closed

[Helion + torch.compile] Refactor template codegen pipeline for extensibility#177064
yf225 wants to merge 27 commits intogh/yf225/136/basefrom
gh/yf225/136/head

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Mar 10, 2026

Stack from ghstack (oldest at bottom):

Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

  • Move epilogue/prologue codegen out of _codegen_single_template (in
    simd.py) and into TritonTemplateKernel.codegen_template_body(), so
    each kernel subclass owns its own source generation.
    _codegen_single_template now only handles the shared orchestration:
    prologue-fused input cleanup, benchmark wrapping, mark_run, and
    define_kernel.

  • Rename SIMDKernel.codegen_template_override → codegen_template_body
    and add get_unfused_epilogues() hook, giving subclasses two clear
    extension points.

  • Add PartialRender._replace_placeholder() for indent-aware hook
    substitution, replacing ad-hoc indent arithmetic scattered across
    load_input() / store_output() / CuteDSL unpack_buffers.

  • Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
    _make_independent_subgraph, _compute_fusion_metadata,
    codegen_prologues_in_subgraphs) from the monolithic load_input /
    store_output methods so they can be reused or overridden by external
    kernel classes.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Differential Revision: D96526007

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177064

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 72eb29f with merge base edf1a92 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@yf225 yf225 changed the title Refactor template codegen pipeline for extensibility [Helion + torch.compile] Refactor template codegen pipeline for extensibility Mar 10, 2026
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@yf225 yf225 added the topic: not user facing topic category label Mar 10, 2026
yf225 added a commit that referenced this pull request Mar 10, 2026
…sibility

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: bbaabe3
Pull Request resolved: #177064
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…sibility

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: bbaabe3
Pull Request resolved: #177064
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@yf225 yf225 force-pushed the gh/yf225/136/head branch from 27c83b4 to 909d1ba Compare March 11, 2026 06:00
yf225 added 2 commits March 10, 2026 23:01
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
self.saved_partial_accumulate: list[PartialAccumulate] = []

def codegen_template_override(
def codegen_template_body(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does helion already use codegen_template_override for integration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new design, I am also planning to change Helion's integration to use codegen_template_body (will be done in Helion PR pytorch/helion#1520). Currently all Helion+torch.compile integration tests are disabled so removing this codegen_template_override extension point should be safe to do

Comment on lines +221 to +223
idx = self._code.find(hook_key)
if idx < 0:
return self._code.replace(hook_key, result)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get a bit confused. By do the replacement if the key is not found. You actually want to return immediately?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good catch - updated this helper function to fix the issue and also made the logic more clear

…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit to yf225/pytorch that referenced this pull request Mar 11, 2026
Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: 4a4a996
Pull Request resolved: pytorch#177064
yf225 added 3 commits March 11, 2026 14:51
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

yf225 added a commit to yf225/pytorch that referenced this pull request Mar 12, 2026
Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: 4a4a996
Pull Request resolved: pytorch#177064
yf225 added 2 commits March 12, 2026 11:54
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
…sibility

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: c7959b4
Pull Request resolved: pytorch/pytorch#177064
yf225 added a commit to yf225/pytorch that referenced this pull request Mar 12, 2026
Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: 4a4a996
Pull Request resolved: pytorch#177064
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added 2 commits March 12, 2026 14:00
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 13, 2026
…sibility

Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: 21d537d
Pull Request resolved: #177064
yf225 added 3 commits March 12, 2026 21:44
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…e for extensibility"


Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 13, 2026

@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 14, 2026

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

yf225 added a commit to yf225/pytorch that referenced this pull request Mar 15, 2026
Restructure the template code generation pipeline to separate concerns
and enable external template backends:

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  (now raises NotImplementedError; actual impl in TritonTemplateKernel)
- Add get_unfused_epilogues() for epilogues that need separate codegen
- Move epilogue/prologue codegen from _codegen_single_template into
  TritonTemplateKernel.codegen_template_body()
- Simplify _codegen_single_template to dispatch to kernel, handle
  benchmark wrapping, mark_run, and define_kernel
- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution (replaces manual indent handling)
- Extract _setup_contiguous_index_state(), _make_independent_subgraph(),
  _make_codegen_hook() helpers from load_input/store_output
- Add SubgraphInfo.root_var_renames for prologue variable renaming
- Add _compute_fusion_metadata() and codegen_prologues_in_subgraphs()

ghstack-source-id: 4a4a996
Pull Request resolved: pytorch#177064
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…sibility (pytorch#177064)

Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

Differential Revision: [D96526007](https://our.internmc.facebook.com/intern/diff/D96526007)
Pull Request resolved: pytorch#177064
Approved by: https://github.com/jansel
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…sibility (pytorch#177064)

Restructure the template code generation pipeline so that external
template backends (e.g. Helion) can participate in epilogue/prologue
fusion without duplicating the Triton-specific codegen logic.

Key changes:

- Move epilogue/prologue codegen out of _codegen_single_template (in
  simd.py) and into TritonTemplateKernel.codegen_template_body(), so
  each kernel subclass owns its own source generation.
  _codegen_single_template now only handles the shared orchestration:
  prologue-fused input cleanup, benchmark wrapping, mark_run, and
  define_kernel.

- Rename SIMDKernel.codegen_template_override → codegen_template_body
  and add get_unfused_epilogues() hook, giving subclasses two clear
  extension points.

- Add PartialRender._replace_placeholder() for indent-aware hook
  substitution, replacing ad-hoc indent arithmetic scattered across
  load_input() / store_output() / CuteDSL unpack_buffers.

- Extract helpers (_setup_contiguous_index_state, _make_codegen_hook,
  _make_independent_subgraph, _compute_fusion_metadata,
  codegen_prologues_in_subgraphs) from the monolithic load_input /
  store_output methods so they can be reused or overridden by external
  kernel classes.

Differential Revision: [D96526007](https://our.internmc.facebook.com/intern/diff/D96526007)
Pull Request resolved: pytorch#177064
Approved by: https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants