Skip to content

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion#177065

Closed
yf225 wants to merge 34 commits intogh/yf225/137/basefrom
gh/yf225/137/head
Closed

[Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion#177065
yf225 wants to merge 34 commits intogh/yf225/137/basefrom
gh/yf225/137/head

Conversation

@yf225
Copy link
Copy Markdown
Contributor

@yf225 yf225 commented Mar 10, 2026

Stack from ghstack (oldest at bottom):

Add test_external_template_prologue_epilogue_fusion that exercises:

  • Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
  • Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
  • Extra inputs: bias is read by the epilogue but is not among the
    template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Differential Revision: D96849203

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177065

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 4 Pending, 3 Unrelated Failures

As of commit 6c88a20 with merge base d1f78bd (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 10, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

yf225 added a commit that referenced this pull request Mar 10, 2026
Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: 094f66e
Pull Request resolved: #177065
…tes"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 10, 2026
Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: 3753929
Pull Request resolved: #177065
@yf225 yf225 changed the title Add ExternalTritonTemplateKernel and scheduler fusion updates [Helion + torch.compile] Add ExternalTritonTemplateKernel and scheduler fusion updates Mar 10, 2026
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 10, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: c0d1028
Pull Request resolved: #177065
@yf225 yf225 added the topic: not user facing topic category label Mar 10, 2026
yf225 added a commit that referenced this pull request Mar 10, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: c0d1028
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: 87c5cfe
Pull Request resolved: #177065
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: c0d1028
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: f139305
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: 929a239
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: fa84eba
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 11, 2026
…er fusion updates

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: c1e6a3b
Pull Request resolved: #177065
yf225 added a commit to yf225/pytorch that referenced this pull request Mar 11, 2026
Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

ghstack-source-id: fa84eba
Pull Request resolved: pytorch#177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 15, 2026
…teKernel fusion

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel, testing the full render-based
fusion pipeline with a mock triton template.

ghstack-source-id: c4d94e0
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D96563983](https://our.internmc.facebook.com/intern/diff/D96563983)

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 15, 2026
…teKernel fusion

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

ghstack-source-id: 4b7c486
Pull Request resolved: #177065
…and scheduler fusion updates"

Add ExternalTritonTemplateKernel class that subclasses
TritonTemplateKernel for external template backends (e.g. Helion).
Key methods: _compute_fusion_metadata(), _setup_fusion_hooks(),
_find_eligible_epilogues(), _setup_epilogue_hook(),
_setup_prologue_hook(), call_kernel(), emit_kernel_override().

Extend TemplateBuffer base class with fields needed by external
backends: epilogue_fusable_outputs, _multi_output_children,
_named_inputs, and add realize_template_input() and
build_multi_outputs() class methods. Add MultiOutputLayout handling
to extract_read_writes(). Set epilogue_fusable_outputs in
TritonTemplateBuffer.

Scheduler changes:
- Generalize prologue fusion: check allowed_prologue_inps instead of
  isinstance(TritonTemplateBuffer)
- Add multi-output template epilogue guard requiring ComputedBuffer
- Replace can_fuse_multi_output_epilogue delegation with inline
  MultiOutput parent check

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D96563983](https://our.internmc.facebook.com/intern/diff/D96563983)

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 16, 2026
…teKernel fusion

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

ghstack-source-id: 86661b2
Pull Request resolved: #177065
@yf225 yf225 changed the title [Helion + torch.compile] Add ExternalTritonTemplateKernel and scheduler fusion updates [Helion + torch.compile] Add unit test for ExternalTritonTemplateKernel fusion Mar 16, 2026
…emplateKernel fusion"


Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
yf225 added a commit that referenced this pull request Mar 16, 2026
…teKernel fusion

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

ghstack-source-id: 1a84b60
Pull Request resolved: #177065
@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 16, 2026

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yf225
Copy link
Copy Markdown
Contributor Author

yf225 commented Mar 17, 2026

@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorchmergebot pushed a commit that referenced this pull request Mar 17, 2026
…tput templates (#177597)

TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels)
don't have a single dtype. Add an explicit error in TemplateBuffer.dtype
for this case, and guard the scheduler's low-precision heuristic with
is_multi_outputs_template() so it skips the check rather than crashing.

Pull Request resolved: #177597
Approved by: https://github.com/shunting314
ghstack dependencies: #177492, #177065
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…el fusion (pytorch#177065)

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

Pull Request resolved: pytorch#177065
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#177492
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…tput templates (pytorch#177597)

TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels)
don't have a single dtype. Add an explicit error in TemplateBuffer.dtype
for this case, and guard the scheduler's low-precision heuristic with
is_multi_outputs_template() so it skips the check rather than crashing.

Pull Request resolved: pytorch#177597
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#177492, pytorch#177065
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…el fusion (pytorch#177065)

Add test_external_template_prologue_epilogue_fusion that exercises:
- Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B>
- Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0>
- Extra inputs: bias is read by the epilogue but is not among the
  template's original inputs, exercising kernel._extra_inputs

Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and
creates an ExternalTritonTemplateKernel. The _render() method calls
kernel._setup_fusion_hooks() to set up all fusion hooks in one call,
then reads kernel._prologue_source_buffers and kernel._extra_store_targets
to build the template source with the appropriate placeholders.

Pull Request resolved: pytorch#177065
Approved by: https://github.com/jansel
ghstack dependencies: pytorch#177492
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…tput templates (pytorch#177597)

TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels)
don't have a single dtype. Add an explicit error in TemplateBuffer.dtype
for this case, and guard the scheduler's low-precision heuristic with
is_multi_outputs_template() so it skips the check rather than crashing.

Pull Request resolved: pytorch#177597
Approved by: https://github.com/shunting314
ghstack dependencies: pytorch#177492, pytorch#177065
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants