[inductor] split reduction even if all reads are broadcasted by shunting314 · Pull Request #167894 · pytorch/pytorch

shunting314 · 2025-11-15T01:16:28Z

Stack from ghstack (oldest at bottom):

-> [inductor] split reduction even if all reads are broadcasted #167894

With split reduction we can speedup the following (extreme) kernel by 48x

# 56ms -> 1.163ms

import torch
from triton.testing import do_bench

def f(x):
    return x.sum(dim=(0, 1))

x = torch.randn(100000000, 1, 2, device="cuda").expand(-1, 2, -1)
opt_f = torch.compile(f)
ref = f(x)
act = opt_f(x)

torch.testing.assert_close(ref, act, atol=1e-3, rtol=1e-3)
ms = do_bench(lambda: opt_f(x))
print(f"ms={ms:.3f}")

Not confident if this change will break things. Let's wait for CI

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78

[ghstack-poisoned]

pytorch-bot · 2025-11-15T01:16:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167894

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 4 Cancelled Jobs, 4 Unrelated Failures

As of commit 0b28245 with merge base 4b9418a ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 87c3003 returned non-zero exit code 1
inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
sam
Lint / lintrunner-noclang-all / linux-job (gh)
RuntimeError: Command docker exec -t 520fb0106ab3f5a9c7554100ca10c42e55cc300dc83ce93af896f540809e3740 /exec failed with exit code 1
trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 2, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 3, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.gfx942.4) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.gfx942.4) (gh)
Process completed with exit code 1.
trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx942.4) (gh)
Process completed with exit code 1.
trunk / macos-py3-arm64 / build (gh)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

inductor / unit-test / inductor-cpu-core-test (3.11) / test (inductor_core, 1, 2, linux.c7i.2xlarge) (gh)
##[error]The operation was canceled.
inductor / unit-test / inductor-cpu-core-test (3.11) / test (inductor_core, 2, 2, linux.c7i.2xlarge) (gh)
##[error]The operation was canceled.
inductor / unit-test / inductor-cpu-core-test (3.12) / test (inductor_core, 1, 2, linux.c7i.2xlarge) (gh)
##[error]The operation was canceled.
inductor / unit-test / inductor-cpu-core-test (3.12) / test (inductor_core, 2, 2, linux.c7i.2xlarge) (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-cpu-core-test (3.13) / test (inductor_core, 1, 2, linux.c7i.2xlarge) (gh) (detected as infra flaky with no log or failing log classifier)
inductor / unit-test / inductor-halide-test / test (inductor-halide, 1, 1, linux.12xlarge) (gh) (disabled by #150624 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_halide.py::HalideCpuTests::test_special_polygamma_cpu_halide
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 3, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #145359 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_inplace_padding.py::InplacePaddingTest::test_linear_and_cel_max_autotune
trunk / linux-jammy-cuda12.8-py3.10-gcc11 / test (default, 5, 5, lf.linux.g6.4xlarge.experimental.nvidia.gpu) (gh) (disabled by #118346 but the issue was closed recently and a rebase is needed to make it pass)
test/inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_mm_triton_kernel_benchmark

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: d94a3a0 Pull Request resolved: #167894

With split reduction we can speedup the following (extreme) kernel by 48x ``` # 56ms -> 1.163ms import torch from triton.testing import do_bench def f(x): return x.sum(dim=(0, 1)) x = torch.randn(100000000, 1, 2, device="cuda").expand(-1, 2, -1) opt_f = torch.compile(f) ref = f(x) act = opt_f(x) torch.testing.assert_close(ref, act, atol=1e-3, rtol=1e-3) ms = do_bench(lambda: opt_f(x)) print(f"ms={ms:.3f}") ``` Not confident if this change will break things. Let's wait for CI cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 0e3d8bf Pull Request resolved: #167894

shunting314 · 2025-11-18T19:50:46Z

@pytorchbot merge -i

pytorchmergebot · 2025-11-18T19:53:55Z

Merge started

Your change will be merged while ignoring the following 4 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / inductor-cpu-test / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.2xlarge.amx), inductor / unit-test / inductor-pallas-test / test (inductor-pallas, 1, 1, linux.g5.12xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-11-18T20:23:27Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m2-15), trunk / macos-py3-arm64 / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

With split reduction we can speedup the following (extreme) kernel by 48x ``` # 56ms -> 1.163ms import torch from triton.testing import do_bench def f(x): return x.sum(dim=(0, 1)) x = torch.randn(100000000, 1, 2, device="cuda").expand(-1, 2, -1) opt_f = torch.compile(f) ref = f(x) act = opt_f(x) torch.testing.assert_close(ref, act, atol=1e-3, rtol=1e-3) ms = do_bench(lambda: opt_f(x)) print(f"ms={ms:.3f}") ``` Not confident if this change will break things. Let's wait for CI cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 3d48a24 Pull Request resolved: #167894

With split reduction we can speedup the following (extreme) kernel by 48x ``` # 56ms -> 1.163ms import torch from triton.testing import do_bench def f(x): return x.sum(dim=(0, 1)) x = torch.randn(100000000, 1, 2, device="cuda").expand(-1, 2, -1) opt_f = torch.compile(f) ref = f(x) act = opt_f(x) torch.testing.assert_close(ref, act, atol=1e-3, rtol=1e-3) ms = do_bench(lambda: opt_f(x)) print(f"ms={ms:.3f}") ``` Not confident if this change will break things. Let's wait for CI cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 0e7e54b Pull Request resolved: #167894

github-actions · 2026-02-04T05:48:00Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

shunting314 · 2026-03-06T18:50:19Z

Didn't land because of accuracy failure for 'sam'. May find some time to further debug.

[inductor] split reduction even if all reads are broadcasted

ecc0f4b

[ghstack-poisoned]

shunting314 mentioned this pull request Nov 15, 2025

[inductor] fix the decision of inner reduction #167697

Closed

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 15, 2025

shunting314 added a commit that referenced this pull request Nov 15, 2025

[inductor] split reduction even if all reads are broadcasted

1bace15

ghstack-source-id: d94a3a0 Pull Request resolved: #167894

shunting314 added a commit that referenced this pull request Nov 15, 2025

[inductor] split reduction even if all reads are broadcasted

28826a6

ghstack-source-id: 0e3d8bf Pull Request resolved: #167894

shunting314 requested review from eellison, jansel and v0i0 November 17, 2025 20:07

jansel approved these changes Nov 18, 2025

View reviewed changes

shunting314 added the topic: not user facing topic category label Nov 18, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 18, 2025

pytorchmergebot added the merging label Nov 18, 2025

pytorchmergebot removed the merging label Nov 18, 2025

eellison approved these changes Nov 18, 2025

View reviewed changes

v0i0 approved these changes Dec 2, 2025

View reviewed changes

shunting314 added a commit that referenced this pull request Dec 6, 2025

[inductor] split reduction even if all reads are broadcasted

0c27572

ghstack-source-id: 3d48a24 Pull Request resolved: #167894

shunting314 added a commit that referenced this pull request Dec 6, 2025

[inductor] split reduction even if all reads are broadcasted

87c3003

ghstack-source-id: 0e7e54b Pull Request resolved: #167894

github-actions bot added the Stale label Feb 4, 2026

github-actions bot closed this Mar 6, 2026

shunting314 reopened this Mar 6, 2026

shunting314 removed the Stale label Mar 6, 2026

shunting314 added the no-stale label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] split reduction even if all reads are broadcasted#167894

[inductor] split reduction even if all reads are broadcasted#167894
shunting314 wants to merge 4 commits intogh/shunting314/263/basefrom
gh/shunting314/263/head

shunting314 commented Nov 15, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Nov 15, 2025 •

edited

Loading

Uh oh!

shunting314 commented Nov 18, 2025

Uh oh!

pytorchmergebot commented Nov 18, 2025

Uh oh!

pytorchmergebot commented Nov 18, 2025

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

shunting314 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shunting314 commented Nov 15, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/167894

❌ 13 New Failures, 4 Cancelled Jobs, 4 Unrelated Failures

Uh oh!

shunting314 commented Nov 18, 2025

Uh oh!

pytorchmergebot commented Nov 18, 2025

Merge started

Uh oh!

pytorchmergebot commented Nov 18, 2025

Merge failed

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

shunting314 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shunting314 commented Nov 15, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Nov 15, 2025 •

edited

Loading