[Tiling rewrite pt1] Normalize reads and writes to common iter space by eellison · Pull Request #153723 · pytorch/pytorch

eellison · 2025-05-16T13:18:10Z

Stack from ghstack (oldest at bottom):

In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

[ghstack-poisoned]

pytorch-bot · 2025-05-16T13:18:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153723

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit e4f4618 with merge base 9258cfc ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge) (gh) (trunk failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/_inductor/tiling_utils.py

[ghstack-poisoned]

ghstack-source-id: 7375ab5 Pull Request resolved: #153723

[ghstack-poisoned]

etaf · 2025-05-17T00:10:24Z

test/inductor/test_loop_ordering.py

+            def foo(x, y):
+                return x + y
+
+            foo(torch.rand([4, 4], device="cuda"), torch.rand([4, 4], device="cuda").T)


Hi, May I suggest we mark these cases as requires_cuda or replace the hardcode cuda with GPU_TYPE here? These new test case will also run on XPU and fail with cuda, thanks.

jansel

Test failures?

…iter space" In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

torch/_inductor/codegen/simd.py

…iter space" In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov [ghstack-poisoned]

[ghstack-poisoned]

pytorchmergebot · 2025-06-02T21:43:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-02T21:46:20Z

Starting merge as part of PR stack under #153730

[ghstack-poisoned]

pytorchmergebot · 2025-06-02T23:02:08Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

eellison · 2025-06-03T13:57:10Z

@pytorchbot merge

pytorchmergebot · 2025-06-03T13:58:58Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: #153730 Approved by: https://github.com/jansel ghstack dependencies: #153723

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: #153748 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in #149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: #153751 Approved by: https://github.com/jansel ghstack dependencies: #153723, #153730, #153748

…ytorch#153723) In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split. Pull Request resolved: pytorch#153723 Approved by: https://github.com/jansel

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: pytorch#153730 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: pytorch#153748 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: pytorch#153730 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: pytorch#153748 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730

This pr uses the coalescing information in generating a tiling. The previous tiling heuristic would have each dependency generate a tiling. Then, we sum up the score for each generated tiling, preferring any 2d tiling over the default. The new tiling heuristics scores each tiling by its global coalesced memory. This gives both a potentially better tiling (especially for more complicated, 3d patterns) as well as information we can use in generating block sizes. In triton heuristics, for generating 3d tiled reductions, we take the same total block size that the 2d reduction would use, then distribute the block according to whichever block coalesces the most memory. The motivating kernel is in pytorch#149982 which is a 32 element reduction. A smaller version of it is [here](https://gist.github.com/eellison/0fa9396f5479eb4dba09756e3bf6ff2a). We need to run this kernel once in the forward per linear layer on a contiguous tensor, and once in the backward on a transposed tensor. While the contiguous kernel has coalesced accesses, and is performant on master, the transposed version accesses uncoalesced memory on main and is ~2.8x slower. See, this [full log](https://gist.github.com/eellison/fa644bfd9d0ae11dadb62e17a5d48a83) from the above repro. Now, with this PR, it is only ~1.15x slower. See the [updated log](https://gist.github.com/eellison/0b2b653309494d28cf7b48929a022075). Pull Request resolved: pytorch#153751 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730, pytorch#153748

Update

5075d43

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo module: inductor labels May 16, 2025

Skylion007 reviewed May 16, 2025

View reviewed changes

torch/_inductor/tiling_utils.py Outdated Show resolved Hide resolved

torch/_inductor/tiling_utils.py Outdated Show resolved Hide resolved

Update

1b7960a

[ghstack-poisoned]

eellison added a commit that referenced this pull request May 16, 2025

[Tiling rewrite pt1] Normalize reads and writes to common iter space

e4f9af7

ghstack-source-id: 7375ab5 Pull Request resolved: #153723

eellison added the topic: not user facing topic category label May 16, 2025

eellison requested a review from jansel May 16, 2025 14:05

Update

5f7044b

[ghstack-poisoned]

This was referenced May 16, 2025

Analyze coalesced mem #153730

Closed

Solve for tilings #153748

Closed

Incorporate coalesce analysis in codegen #153751

Closed

etaf reviewed May 17, 2025

View reviewed changes

jansel requested changes May 19, 2025

View reviewed changes

eellison requested a review from jansel May 21, 2025 01:14

eellison mentioned this pull request May 21, 2025

test #154005

Closed

jansel approved these changes May 21, 2025

View reviewed changes

torch/_inductor/codegen/simd.py Outdated Show resolved Hide resolved

eellison mentioned this pull request May 22, 2025

debug diff #154163

Closed

Update

68c4177

[ghstack-poisoned]

This was referenced May 27, 2025

test #154429

Closed

simplify modularindexing #154523

Closed

antoher test #154535

Closed

Update

2eaf1f5

[ghstack-poisoned]

eellison mentioned this pull request May 30, 2025

Turn on new tiling by default #154768

Closed

Update

7b2f15d

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 2, 2025

pytorchmergebot added the merging label Jun 2, 2025

Update

e4f4618

[ghstack-poisoned]

pytorchmergebot removed the merging label Jun 2, 2025

pytorchmergebot added the merging label Jun 3, 2025

pytorchmergebot closed this in 00dfd38 Jun 3, 2025

pytorchmergebot added Merged and removed merging labels Jun 3, 2025

pytorchmergebot pushed a commit that referenced this pull request Jun 3, 2025

Analyze coalesced mem (#153730)

0adbde4

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: #153730 Approved by: https://github.com/jansel ghstack dependencies: #153723

github-actions bot deleted the gh/eellison/790/head branch July 4, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tiling rewrite pt1] Normalize reads and writes to common iter space#153723

[Tiling rewrite pt1] Normalize reads and writes to common iter space#153723
eellison wants to merge 9 commits intogh/eellison/790/basefrom
gh/eellison/790/head

eellison commented May 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

etaf May 17, 2025

Uh oh!

jansel left a comment

Uh oh!

Uh oh!

pytorchmergebot commented Jun 2, 2025

Uh oh!

pytorchmergebot commented Jun 2, 2025

Uh oh!

pytorchmergebot commented Jun 2, 2025

Uh oh!

eellison commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

eellison commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153723

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Uh oh!

Uh oh!

etaf May 17, 2025

Choose a reason for hiding this comment

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorchmergebot commented Jun 2, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 2, 2025

Uh oh!

pytorchmergebot commented Jun 2, 2025

Merge failed

Uh oh!

eellison commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

eellison commented May 16, 2025 •

edited

Loading

pytorch-bot bot commented May 16, 2025 •

edited

Loading