Skip to content

[Inductor] Set the default value of min_chunk_size to 512#150762

Closed
jiayisunx wants to merge 12 commits intogh/jiayisunx/63/basefrom
gh/jiayisunx/63/head
Closed

[Inductor] Set the default value of min_chunk_size to 512#150762
jiayisunx wants to merge 12 commits intogh/jiayisunx/63/basefrom
gh/jiayisunx/63/head

Conversation

@jiayisunx
Copy link
Collaborator

@jiayisunx jiayisunx commented Apr 7, 2025

Stack from ghstack (oldest at bottom):

Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @mlazos

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150762

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 81030e5 with merge base 70b4a88 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jiayisunx added a commit that referenced this pull request Apr 7, 2025
@jiayisunx jiayisunx marked this pull request as draft April 7, 2025 09:32
[ghstack-poisoned]
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Apr 16, 2025
jiayisunx added a commit that referenced this pull request May 27, 2025
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jun 10, 2025
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jun 18, 2025
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jun 19, 2025
[ghstack-poisoned]
@jiayisunx jiayisunx marked this pull request as ready for review July 1, 2025 01:21
@leslie-fang-intel leslie-fang-intel requested a review from CaoE July 1, 2025 06:48
f"{local_buf_dtype}* {local_buffer_name} = buf_{local_buffer_name}.get();"
)
gen_loop_nest(loop_nest)
worksharing.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I know for which case, we need to add this line of change?

Copy link
Collaborator Author

@jiayisunx jiayisunx Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many UTs in test/inductor/test_cpu_select_algorithm.py failed after parallelization. Without this line, worksharing was not closed and the output code was missing a }.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to understand more about the difference. Will these UTs also fail if we keep the chunk size but increase the loop size?

Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Please elaborate more about why the changes of worksharing.close() is needed.

jiayisunx added a commit that referenced this pull request Jul 9, 2025
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jul 9, 2025
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jul 9, 2025
[ghstack-poisoned]
@jiayisunx jiayisunx added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 11, 2025
@jiayisunx jiayisunx requested a review from jansel July 11, 2025 07:22
@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Jul 14, 2025

@pytorchbot revert -m 'Sorry for reverting your change, but an inductor compilation error shows up in trunk' -c nosignal

https://github.com/pytorch/pytorch/actions/runs/16255653984/job/45891797651#step:25:5755

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jul 14, 2025
…50762)"

This reverts commit 3321acc.

Reverted #150762 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but an inductor compilation error shows up in trunk ([comment](#150762 (comment)))
@pytorchmergebot
Copy link
Collaborator

@jiayisunx your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 14, 2025
jiayisunx added a commit that referenced this pull request Jul 16, 2025
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: #150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack-source-id: b524b9e
[ghstack-poisoned]
jiayisunx added a commit that referenced this pull request Jul 21, 2025
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.

Pull Request resolved: #150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack-source-id: adf05f5
[ghstack-poisoned]
@jiayisunx
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-actions github-actions bot deleted the gh/jiayisunx/63/head branch August 21, 2025 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants