Skip to content

Refactor outer reduction heuristic in addition to Blackwell#164384

Closed
PaulZhang12 wants to merge 8 commits intogh/PaulZhang12/31/basefrom
gh/PaulZhang12/31/head
Closed

Refactor outer reduction heuristic in addition to Blackwell#164384
PaulZhang12 wants to merge 8 commits intogh/PaulZhang12/31/basefrom
gh/PaulZhang12/31/head

Conversation

@PaulZhang12
Copy link
Contributor

@PaulZhang12 PaulZhang12 commented Oct 1, 2025

Stack from ghstack (oldest at bottom):

Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:

M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s

Larger N requires a fused backwards kernel, autotuning results for larger N are similar

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164384

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 4 Unrelated Failures

As of commit 5892a84 with merge base 2ce894b (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

PaulZhang12 added a commit that referenced this pull request Oct 1, 2025
@PaulZhang12 PaulZhang12 added the topic: not user facing topic category label Oct 1, 2025
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 1, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 8, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 8, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 13, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 15, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Oct 28, 2025
Previously, the outer reduction heuristics were a bit messy. This PR simplifies the heuristic, while also adding considerations for shrinking num_warps, as preferred by hopper and blackwell, along with accounting for rnumel when constructing rblock.

Bandwidth numbers from Quack shapes RMSNorm backwards on B200 that contains outer reductions:
```
M = 32768 N = 512
Old Heuristic: 901 GB/s
New Heuristic: 1222 GB/s

M = 32768 N = 1024
Old Heuristic: 1584 GB/s
New Heuristic: 1807 GB/s

M = 32768 N = 2048
Old Heuristic: 1898 GB/s
New Heuristic: 2146 GB/s

M = 32768 N = 4096
Old Heuristic: 2102 GB/s
New Heuristic: 2096 GB/s
```

Larger N requires a fused backwards kernel, autotuning results for larger N are similar


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
PaulZhang12 added a commit that referenced this pull request Nov 6, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jan 6, 2026
@github-actions github-actions bot closed this Feb 5, 2026
@github-actions github-actions bot deleted the gh/PaulZhang12/31/head branch March 8, 2026 02:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant