enable float32 and float16 in `torch._grouped_mm` fallback by vkuzo · Pull Request #162059 · pytorch/pytorch

vkuzo · 2025-09-03T13:34:00Z

Stack from ghstack (oldest at bottom):

Summary:

Enables torch.float32 and torch.float16 options in
torch._grouped_mm. Note that the fast path is only enabled if mat_a,
mat_b, and out_dtype are torch.bfloat16.

Saving for future PRs:

enabling testing on more platforms
supporting out_dtype != mat_a.dtype
opinfo
better compile support

Test Plan:

// on A100 and H100
pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x
// on H100
pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

pytorch-bot · 2025-09-03T13:34:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162059

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8932199 with merge base aed33a8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fdc346e Pull Request resolved: #162059

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6893b58 Pull Request resolved: #162059

aten/src/ATen/native/GroupedMMUtils.h

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 23e9fd6 Pull Request resolved: #162059

eqy · 2025-09-04T14:41:20Z

Should the compute-capability of tests be gated if #161407 is only for sm80+?

vkuzo · 2025-09-04T15:25:21Z

Should the compute-capability of tests be gated if #161407 is only for sm80+?

currently the high precision grouped_gemm tests are gated with @unittest.skipIf(not SM80OrLater, "Grouped gemm supported only on SM80 or greater") (added earlier in this stack). The fallback should work on earlier GPUs as well, but I currently only have an A100 to test on. Would be interested in thoughts on if there are additional GPU cards in CI we can enable these tests for - the fallback should be supported anywhere where torch.mm is supported.

vkuzo · 2025-09-04T17:44:48Z

@pytorchbot merge

pytorchmergebot · 2025-09-04T17:47:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…62059) Summary: Enables `torch.float32` and `torch.float16` options in `torch._grouped_mm`. Note that the fast path is only enabled if `mat_a`, `mat_b`, and `out_dtype` are `torch.bfloat16`. Saving for future PRs: 1. enabling testing on more platforms 2. supporting out_dtype != mat_a.dtype 3. opinfo 4. better compile support Test Plan: ```bash // on A100 and H100 pytest test/test_matmul_cuda.py -s -k test_grouped_gemm -x // on H100 pytest test/test_matmul_cuda.py -s -k test_scaled_grouped_gemm -x ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#162059 Approved by: https://github.com/ngimel, https://github.com/eqy ghstack dependencies: pytorch#161407, pytorch#161717

…k was added (#165378) #162059 means we get unexpected successes now on e.g., SM 12.0 Pull Request resolved: #165378 Approved by: https://github.com/Skylion007

vkuzo requested review from eqy and syed-ahmed as code owners September 3, 2025 13:34

This was referenced Sep 3, 2025

create torch._grouped_mm fallback path with for loops / bmm #161407

Closed

move _grouped_mm fallback to composite explicit autograd #161717

Closed

vkuzo requested review from drisspg and ngimel September 3, 2025 13:35

vkuzo added the topic: not user facing topic category label Sep 3, 2025

ngimel added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 3, 2025

ngimel reviewed Sep 3, 2025

View reviewed changes

aten/src/ATen/native/GroupedMMUtils.h Outdated Show resolved Hide resolved

vkuzo requested a review from ngimel September 4, 2025 12:20

vkuzo mentioned this pull request Sep 4, 2025

clean up extra declaration in aten/src/ATen/native/Blas.cpp #162159

Closed

eqy added the ciflow/h100 label Sep 4, 2025

ngimel approved these changes Sep 4, 2025

View reviewed changes

eqy approved these changes Sep 4, 2025

View reviewed changes

pytorchmergebot added the merging label Sep 4, 2025

pytorchmergebot added the Merged label Sep 4, 2025

pytorchmergebot closed this in 9eadb37 Sep 4, 2025

pytorchmergebot removed the merging label Sep 4, 2025

github-actions bot deleted the gh/vkuzo/6/head branch October 5, 2025 02:17

eqy mentioned this pull request Oct 13, 2025

[CUDA][Grouped Gemm] remove xFail on Group GEMM tests after fallback was added #165378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable float32 and float16 in `torch._grouped_mm` fallback#162059

enable float32 and float16 in `torch._grouped_mm` fallback#162059
vkuzo wants to merge 3 commits intogh/vkuzo/6/basefrom
gh/vkuzo/6/head

vkuzo commented Sep 3, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

eqy commented Sep 4, 2025 •

edited

Loading

Uh oh!

vkuzo commented Sep 4, 2025

Uh oh!

vkuzo commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vkuzo commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162059

✅ No Failures

Uh oh!

Uh oh!

eqy commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Sep 4, 2025

Uh oh!

vkuzo commented Sep 4, 2025

Uh oh!

pytorchmergebot commented Sep 4, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vkuzo commented Sep 3, 2025 •

edited

Loading

pytorch-bot bot commented Sep 3, 2025 •

edited

Loading

eqy commented Sep 4, 2025 •

edited

Loading