[Helion + torch.compile] Fix MultiOutput write deps and extend fusion score matching#177302
[Helion + torch.compile] Fix MultiOutput write deps and extend fusion score matching#177302yf225 wants to merge 5 commits intogh/yf225/138/basefrom
Conversation
…ent fusion matching Normalize using the same policy as SchedulerNode so that the index expressions are directly comparable during fusion checks. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/177302
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit ef99f9a with merge base e0818d6 ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
shunting314
left a comment
There was a problem hiding this comment.
Would be good to have a test for the discovered failure
…for consistent fusion matching" Normalize using the same policy as SchedulerNode so that the index expressions are directly comparable during fusion checks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…for consistent fusion matching" Normalize using the same policy as SchedulerNode so that the index expressions are directly comparable during fusion checks. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
…tend fusion score matching" Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. [ghstack-poisoned]
… score matching Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. ghstack-source-id: 7b40701 Pull Request resolved: #177302
…tend fusion score matching" Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. [ghstack-poisoned]
|
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ble base class (#177367) This is a reland of #177063. Move common fields and methods from TritonTemplateBuffer up to TemplateBuffer so that external template backends (e.g. Helion) can reuse the same mutation-tracking and prologue-fusion infrastructure: - Add mutated_inputs, allowed_prologue_inps params to TemplateBuffer.__init__ - Build mutation_outputs list in base class (parallel to ExternKernel.mutation_outputs) - Move get_allowed_prologue_inps() to base class - Extract _read_deps_from_inputs() helper from extract_read_writes() - Remove can_fuse_multi_output_epilogue() (always returned False, unused) - Simplify TritonTemplateBuffer.__init__() to delegate to super() get_outputs() stays on TritonTemplateBuffer since it is the only subclass that currently passes mutated_inputs; other subclasses (CppTemplateBuffer, CuteDSLTemplateBuffer, etc.) manage their own output lists independently. Pull Request resolved: #177367 Approved by: https://github.com/shunting314 ghstack dependencies: #177302
… score matching Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. ghstack-source-id: 4478c51 Pull Request resolved: pytorch#177302
… score matching (pytorch#177302) Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. This PR also now contains the changes from pytorch#177359, with proper fixes to avoid breaking internal tests. Pull Request resolved: pytorch#177302 Approved by: https://github.com/shunting314, https://github.com/jansel
…ble base class (pytorch#177367) This is a reland of pytorch#177063. Move common fields and methods from TritonTemplateBuffer up to TemplateBuffer so that external template backends (e.g. Helion) can reuse the same mutation-tracking and prologue-fusion infrastructure: - Add mutated_inputs, allowed_prologue_inps params to TemplateBuffer.__init__ - Build mutation_outputs list in base class (parallel to ExternKernel.mutation_outputs) - Move get_allowed_prologue_inps() to base class - Extract _read_deps_from_inputs() helper from extract_read_writes() - Remove can_fuse_multi_output_epilogue() (always returned False, unused) - Simplify TritonTemplateBuffer.__init__() to delegate to super() get_outputs() stays on TritonTemplateBuffer since it is the only subclass that currently passes mutated_inputs; other subclasses (CppTemplateBuffer, CuteDSLTemplateBuffer, etc.) manage their own output lists independently. Pull Request resolved: pytorch#177367 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177302
… score matching (pytorch#177302) Give MultiOutput proper MemoryDep writes derived from its own FixedLayout instead of inheriting StarDep from InputsKernel. This removes the hack in FusedSchedulerNode.fuse() that copied index expressions from the template parent. Extend score_fusion_memory to use name-based dep matching for templates so that views/reshapes between template outputs and epilogues do not block fusion. This PR also now contains the changes from pytorch#177359, with proper fixes to avoid breaking internal tests. Pull Request resolved: pytorch#177302 Approved by: https://github.com/shunting314, https://github.com/jansel
…ble base class (pytorch#177367) This is a reland of pytorch#177063. Move common fields and methods from TritonTemplateBuffer up to TemplateBuffer so that external template backends (e.g. Helion) can reuse the same mutation-tracking and prologue-fusion infrastructure: - Add mutated_inputs, allowed_prologue_inps params to TemplateBuffer.__init__ - Build mutation_outputs list in base class (parallel to ExternKernel.mutation_outputs) - Move get_allowed_prologue_inps() to base class - Extract _read_deps_from_inputs() helper from extract_read_writes() - Remove can_fuse_multi_output_epilogue() (always returned False, unused) - Simplify TritonTemplateBuffer.__init__() to delegate to super() get_outputs() stays on TritonTemplateBuffer since it is the only subclass that currently passes mutated_inputs; other subclasses (CppTemplateBuffer, CuteDSLTemplateBuffer, etc.) manage their own output lists independently. Pull Request resolved: pytorch#177367 Approved by: https://github.com/shunting314 ghstack dependencies: pytorch#177302
Stack from ghstack (oldest at bottom):
Give MultiOutput proper MemoryDep writes derived from its own FixedLayout
instead of inheriting StarDep from InputsKernel. This removes the hack
in FusedSchedulerNode.fuse() that copied index expressions from the
template parent. Extend score_fusion_memory to use name-based dep
matching for templates so that views/reshapes between template outputs
and epilogues do not block fusion.
This PR also now contains the changes from #177359, with proper fixes to avoid breaking internal tests.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo