Skip to content

[Inductor] fix performance regression caused by #173662#176772

Closed
AmesingFlank wants to merge 9 commits intogh/AmesingFlank/6/basefrom
gh/AmesingFlank/6/head
Closed

[Inductor] fix performance regression caused by #173662#176772
AmesingFlank wants to merge 9 commits intogh/AmesingFlank/6/basefrom
gh/AmesingFlank/6/head

Conversation

@AmesingFlank
Copy link
Contributor

@AmesingFlank AmesingFlank commented Mar 7, 2026

Stack from ghstack (oldest at bottom):

Tested by running

python benchmarks/dynamo/pr_time_benchmarks/benchmarks/mm_loop.py a d

before this PR:

collecting compile time instruction count for mm_loop_inductor_gpu
W0307 01:06:14.007000 19968 /home/dev/pytorch/torch/_inductor/utils.py:1720] [0/0] Not enough SMs to use max_autotune_gemm mode
compile time instruction count for iteration 0 is 18639759705
compile time instruction count for iteration 1 is 4313416006
compile time instruction count for iteration 2 is 4306442862
compile time instruction count for iteration 3 is 4312169006
compile time instruction count for iteration 4 is 4311841549
collecting compile time instruction count for mm_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 8210716786
compile time instruction count for iteration 1 is 8000993645
compile time instruction count for iteration 2 is 7997733533
compile time instruction count for iteration 3 is 7993982380
compile time instruction count for iteration 4 is 7994009181

with this PR:

collecting compile time instruction count for mm_loop_inductor_gpu
W0307 01:01:10.094000 14988 /home/dev/pytorch/torch/_inductor/utils.py:1720] [0/0] Not enough SMs to use max_autotune_gemm mode
compile time instruction count for iteration 0 is 18228833593
compile time instruction count for iteration 1 is 58028492104
compile time instruction count for iteration 2 is 3907665800
compile time instruction count for iteration 3 is 3903875384
compile time instruction count for iteration 4 is 3904861924
collecting compile time instruction count for mm_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 7664300088
compile time instruction count for iteration 1 is 7598450061
compile time instruction count for iteration 2 is 7596564537
compile time instruction count for iteration 3 is 7590008847
compile time instruction count for iteration 4 is 7589530612

Also, user triton kernel fusion isn't affected:

python -m pytest test/inductor/test_triton_kernels.py -k TestUserKernelEpilogueFusion

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela

[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 7, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176772

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5aecd88 with merge base 9774102 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 7, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

AmesingFlank added a commit that referenced this pull request Mar 7, 2026
AmesingFlank added a commit that referenced this pull request Mar 7, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 7, 2026
[ghstack-poisoned]
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 7, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 7, 2026
AmesingFlank added a commit that referenced this pull request Mar 8, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 8, 2026
Copy link
Contributor

@laithsakka laithsakka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i land as as

@laithsakka
Copy link
Contributor

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 8, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team Raised by workflow job

AmesingFlank added a commit that referenced this pull request Mar 8, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 8, 2026
@AmesingFlank
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

AmesingFlank added a commit that referenced this pull request Mar 8, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 8, 2026
@AmesingFlank
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

AmesingFlank added a commit that referenced this pull request Mar 8, 2026
[ghstack-poisoned]
AmesingFlank added a commit that referenced this pull request Mar 8, 2026
@AmesingFlank
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

v0i0 added a commit to pytorch/helion that referenced this pull request Mar 11, 2026
The inductor NameError flake (buf1 not defined) that prompted the pin
has been fixed upstream by multiple PyTorch PRs:
- pytorch/pytorch#176772 (fix is_unfusable scheduler logic)
- pytorch/pytorch#176832 (guard get_read_writes behind config flag)
- pytorch/pytorch#177062 (fix MultiOutput write deps)

The root cause was pytorch/pytorch#173662 which overrode
UserDefinedTritonKernel.get_read_writes() even when the epilogue
fusion feature was disabled, breaking buffer scheduling for
TritonTemplateBuffer subclasses like Helion's HelionTemplateBuffer.

Verified the previously-failing test passes against current PyTorch main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v0i0 added a commit to pytorch/helion that referenced this pull request Mar 12, 2026
The inductor NameError flake (buf1 not defined) that prompted the pin
has been fixed upstream by multiple PyTorch PRs:
- pytorch/pytorch#176772 (fix is_unfusable scheduler logic)
- pytorch/pytorch#176832 (guard get_read_writes behind config flag)
- pytorch/pytorch#177062 (fix MultiOutput write deps)

The root cause was pytorch/pytorch#173662 which overrode
UserDefinedTritonKernel.get_read_writes() even when the epilogue
fusion feature was disabled, breaking buffer scheduling for
TritonTemplateBuffer subclasses like Helion's HelionTemplateBuffer.

Verified the previously-failing test passes against current PyTorch main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v0i0 added a commit to pytorch/helion that referenced this pull request Mar 12, 2026
The inductor NameError flake (buf1 not defined) that prompted the pin
has been fixed upstream by multiple PyTorch PRs:
- pytorch/pytorch#176772 (fix is_unfusable scheduler logic)
- pytorch/pytorch#176832 (guard get_read_writes behind config flag)
- pytorch/pytorch#177062 (fix MultiOutput write deps)

The root cause was pytorch/pytorch#173662 which overrode
UserDefinedTritonKernel.get_read_writes() even when the epilogue
fusion feature was disabled, breaking buffer scheduling for
TritonTemplateBuffer subclasses like Helion's HelionTemplateBuffer.

Verified the previously-failing test passes against current PyTorch main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants