[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel#177492
[Helion + torch.compile] Add prologue/epilogue fusion to ExternalTritonTemplateKernel#177492yf225 wants to merge 3 commits intogh/yf225/143/basefrom
Conversation
…onTemplateKernel Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177492
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 11 Pending, 1 Unrelated FailureAs of commit 392db5e with merge base d1f78bd ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…xternalTritonTemplateKernel" Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…xternalTritonTemplateKernel" Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
|
@yf225 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…el fusion (#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: #177065 Approved by: https://github.com/jansel ghstack dependencies: #177492
…tput templates (#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: #177597 Approved by: https://github.com/shunting314 ghstack dependencies: #177492, #177065
…onTemplateKernel (pytorch#177492) Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions Differential Revision: [D96779294](https://our.internmc.facebook.com/intern/diff/D96779294) Pull Request resolved: pytorch#177492 Approved by: https://github.com/jansel
…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492
…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065
…onTemplateKernel (pytorch#177492) Add fusion methods to ExternalTritonTemplateKernel: - _compute_fusion_metadata(): determines eligible epilogues/prologues, builds epilogue specs, and computes prologue sources before render() - _setup_fusion_hooks(): sets up epilogue/prologue render hooks and marks prologue buffers during render() (inside V.kernel context) - _find_eligible_epilogues(): computes fusion eligibility and registers extra inputs needed by fused epilogues - _setup_epilogue_hook(): creates independent subgraph with contiguous index state for epilogue codegen - _setup_prologue_hook(): creates subgraph with capture store handler for prologue codegen, with variable renaming to avoid collisions Differential Revision: [D96779294](https://our.internmc.facebook.com/intern/diff/D96779294) Pull Request resolved: pytorch#177492 Approved by: https://github.com/jansel
…el fusion (pytorch#177065) Add test_external_template_prologue_epilogue_fusion that exercises: - Prologue fusion: sigmoid(b) fused into template as <LOAD_INPUT_B> - Epilogue fusion: relu(...) * bias fused into template as <STORE_OUTPUT_0> - Extra inputs: bias is read by the epilogue but is not among the template's original inputs, exercising kernel._extra_inputs Uses a _MockExternalTemplateBuffer that subclasses TemplateBuffer and creates an ExternalTritonTemplateKernel. The _render() method calls kernel._setup_fusion_hooks() to set up all fusion hooks in one call, then reads kernel._prologue_source_buffers and kernel._extra_store_targets to build the template source with the appropriate placeholders. Pull Request resolved: pytorch#177065 Approved by: https://github.com/jansel ghstack dependencies: pytorch#177492
…tput templates (pytorch#177597) TemplateBuffer subclasses with MultiOutputLayout (e.g. Helion kernels) don't have a single dtype. Add an explicit error in TemplateBuffer.dtype for this case, and guard the scheduler's low-precision heuristic with is_multi_outputs_template() so it skips the check rather than crashing. Pull Request resolved: pytorch#177597 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177492, pytorch#177065
Stack from ghstack (oldest at bottom):
Add fusion methods to ExternalTritonTemplateKernel:
builds epilogue specs, and computes prologue sources before render()
marks prologue buffers during render() (inside V.kernel context)
extra inputs needed by fused epilogues
index state for epilogue codegen
for prologue codegen, with variable renaming to avoid collisions
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo
Differential Revision: D96779294