support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F#154680
support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F#154680jeffdaily wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154680
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 4b651a9 with merge base 31405a6 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Cherry-pick of upstream pytorch#154680.
Cherry-pick of upstream pytorch#154680.
Cherry-pick of upstream pytorch#154680.
|
@pytorchbot rebase |
|
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Requires CUDA >= 12.9 and sm_90. hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
|
Successfully rebased |
dab977a to
4b651a9
Compare
|
@malfet you need to reimport after the rebase? |
|
@pytorchbot merge -f "unrelated failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Most of the work had already been done by @jeffdaily in #154680, but there was one remaining check that needed to be modified in order for `torch._scaled_mm` to use cuBLAS over CUTLASS when available. I tested this change by rebuilding PyTorch locally with CUDA 12.9 and ran `torch._scaled_mm` under the profiler, and observed that the kernel being launched is called `nvjet_qqtst_128x128_128x6_1x1_h_bz_coopA_algo2_ovscale_TNT` (where `ovscale` stands for "outer vector scaling", I believe, which is how cuBLAS calls this scaling mode). I then benchmarked the new kernels against the old CUTLASS ones on a standard 700W H100 GPU. I used the same approach as in #134781, and obtained these speed-ups:   We see that the two kernels perform very closely (I'm surprised, I would have expected cuBLAS to outperform CUTLASS across the board), with some thin/skewed shapes becoming worse but some very large shapes becoming better. I guess the questions are whether we consider this a net-zero change (given that there's improvements _and_ degradations), and how large we consider the burden of maintaining our own CUTLASS kernels. Pull Request resolved: #157905 Approved by: https://github.com/eqy, https://github.com/Skylion007, https://github.com/drisspg
Requires CUDA >= 12.9 and sm_90.
hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.