Solve for tilings by eellison · Pull Request #153748 · pytorch/pytorch

eellison · 2025-05-16T18:46:13Z

Stack from ghstack (oldest at bottom):

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses.

For instance - for the following expression: (32*p0) // 2048, tiling p0 by 64 will make this expression coalesced.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-05-16T18:46:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153748

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ca2b747 with merge base 9258cfc ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

torch/_inductor/tiling_utils.py

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

test/inductor/test_loop_ordering.py

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

[ghstack-poisoned]

eellison · 2025-06-03T14:29:37Z

@pytorchbot merge -i

pytorchmergebot · 2025-06-03T14:31:44Z

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: #153751 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730, #153748

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: pytorch#153748 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: pytorch#153748 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

Update

1f8a6a8

[ghstack-poisoned]

eellison mentioned this pull request May 16, 2025

[Tiling rewrite pt1] Normalize reads and writes to common iter space #153723

Closed

eellison mentioned this pull request May 16, 2025

Analyze coalesced mem #153730

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels May 16, 2025

eellison added the topic: not user facing topic category label May 16, 2025

Update

6a53ed5

[ghstack-poisoned]

eellison requested a review from jansel May 16, 2025 18:53

eellison mentioned this pull request May 16, 2025

Incorporate coalesce analysis in codegen #153751

Closed

Skylion007 reviewed May 19, 2025

View reviewed changes

torch/_inductor/tiling_utils.py Show resolved Hide resolved

eellison mentioned this pull request May 21, 2025

test #154005

Closed

etaf reviewed May 22, 2025

View reviewed changes

test/inductor/test_loop_ordering.py Outdated Show resolved Hide resolved

jansel approved these changes May 22, 2025

View reviewed changes

eellison mentioned this pull request May 22, 2025

debug diff #154163

Closed

Update

1ad42b3

[ghstack-poisoned]

This was referenced May 27, 2025

test #154429

Closed

simplify modularindexing #154523

Closed

antoher test #154535

Closed

Update

45ee02e

[ghstack-poisoned]

eellison mentioned this pull request May 30, 2025

Turn on new tiling by default #154768

Closed

Update

5caf0a5

[ghstack-poisoned]

eellison added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2025

Update

ca2b747

[ghstack-poisoned]

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot added the Merged label Jun 3, 2025

pytorchmergebot closed this in 2608927 Jun 3, 2025

pytorchmergebot removed the merging label Jun 3, 2025

github-actions bot deleted the gh/eellison/792/head branch July 4, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve for tilings#153748

Solve for tilings#153748
eellison wants to merge 9 commits intogh/eellison/792/basefrom
gh/eellison/792/head

eellison commented May 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

eellison commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

eellison commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153748

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

eellison commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eellison commented May 16, 2025 •

edited

Loading

pytorch-bot bot commented May 16, 2025 •

edited

Loading