support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F by jeffdaily · Pull Request #154680 · pytorch/pytorch

jeffdaily · 2025-05-29T23:13:11Z

Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.

pytorch-bot · 2025-05-29T23:13:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154680

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 4b651a9 with merge base 31405a6 ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3-clang12-executorch / build (gh)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Cherry-pick of upstream pytorch#154680.

pruthvistony · 2025-06-03T21:58:36Z

@pytorchbot rebase

facebook-github-bot · 2025-06-03T21:59:07Z

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorchmergebot · 2025-06-03T22:00:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.

pytorchmergebot · 2025-06-03T22:00:12Z

Successfully rebased blaslt_matmul_matrix_scale_outer_vec_32f onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout blaslt_matmul_matrix_scale_outer_vec_32f && git pull --rebase)

jeffdaily · 2025-06-04T16:11:56Z

@malfet you need to reimport after the rebase?

jeffdaily · 2025-06-18T18:37:11Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2025-06-18T18:38:48Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@jeffdaily

Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available. I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode). I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups: ![image](https://github.com/user-attachments/assets/43dfb816-9ccf-40c5-8b2a-571ce9cb511d) ![image](https://github.com/user-attachments/assets/be7ac6f2-e16c-479b-ad5c-f8039caba4b1) We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better. I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels. Pull Request resolved: #157905 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg

jeffdaily added release notes: rocm mandatorylabel release notes: cuda release notes category ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels May 29, 2025

pytorchbot added the open source label May 29, 2025

jeffdaily marked this pull request as ready for review June 2, 2025 21:11

jeffdaily requested review from eqy, jithunnair-amd and syed-ahmed as code owners June 2, 2025 21:11

This was referenced Jun 2, 2025

[release/2.7] support hipblaslt outer vec 32f enum ROCm/pytorch#2226

Merged

[release/2.6] support hipblaslt outer vec 32f enum ROCm/pytorch#2228

Merged

[rocm6.5_internal_testing] support hipblaslt outer vec 32f enum ROCm/pytorch#2230

Merged

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025

[release/2.7] support hipblaslt outer vec 32f enum (#2226)

2337da4

Cherry-pick of upstream pytorch#154680.

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025

[release/2.6] support hipblaslt outer vec 32f enum (#2228)

e4e6806

Cherry-pick of upstream pytorch#154680.

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Jun 3, 2025

[rocm6.5_internal_testing] support hipblaslt outer vec 32f enum (#2230)

1aa5de4

Cherry-pick of upstream pytorch#154680.

malfet approved these changes Jun 3, 2025

View reviewed changes

atalman approved these changes Jun 3, 2025

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 3, 2025

jeffdaily added 5 commits June 3, 2025 22:00

support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F

119f849

Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.

missing #endif

97dbc05

third time is the charm, right?

a6634c7

avoid error message for older vec ext enum

502e44e

fix tunableop use of new enum

4b651a9

pytorchmergebot force-pushed the blaslt_matmul_matrix_scale_outer_vec_32f branch from dab977a to 4b651a9 Compare June 3, 2025 22:00

pytorchmergebot added the merging label Jun 18, 2025

pytorchmergebot closed this in 30d3cf6 Jun 18, 2025

pytorchmergebot added Merged and removed merging labels Jun 18, 2025

lw mentioned this pull request Jul 9, 2025

Use new cuBLAS row-wise fp8 matmul for scaled-mm #157905

Closed

jeffdaily mentioned this pull request Jul 21, 2025

Only use HIPBLASLT_VEC_EXT for rocm < 7 #158791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F#154680

support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F#154680
jeffdaily wants to merge 5 commits intopytorch:mainfrom
ROCm:blaslt_matmul_matrix_scale_outer_vec_32f

jeffdaily commented May 29, 2025

Uh oh!

pytorch-bot Bot commented May 29, 2025 •

edited

Loading

Uh oh!

pruthvistony commented Jun 3, 2025

Uh oh!

facebook-github-bot commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

jeffdaily commented Jun 4, 2025

Uh oh!

jeffdaily commented Jun 18, 2025

Uh oh!

pytorchmergebot commented Jun 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

jeffdaily commented May 29, 2025

Uh oh!

pytorch-bot Bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154680

❌ 1 New Failure

Uh oh!

pruthvistony commented Jun 3, 2025

Uh oh!

facebook-github-bot commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

pytorchmergebot commented Jun 3, 2025

Uh oh!

jeffdaily commented Jun 4, 2025

Uh oh!

jeffdaily commented Jun 18, 2025

Uh oh!

pytorchmergebot commented Jun 18, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pytorch-bot Bot commented May 29, 2025 •

edited

Loading