FP8 rowwise scaling by drisspg · Pull Request #125204 · pytorch/pytorch

drisspg · 2024-04-30T00:31:51Z

Summary

This pull request introduces an fp8 row-scaling kernel as an optional implementation for scaled_mm. The kernel selection is based on the scaling tensors of the inputs. For inputs x and y of shape [M, K] and [K, N] respectively, the following conditions must be met:

x's scale should be a 1-dimensional tensor of length M.
y's scale should be a 1-dimensional tensor of length N.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for y are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:

Todo

We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace sm_90 with sm_90a?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

ifdef

I tried to use : #if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900 to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

cc @yanbing-j @vkuzo @albanD @kadeng

pytorch-bot · 2024-04-30T00:31:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125204

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (31 Unrelated Failures)

As of commit 4448397 with merge base 4448397 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

windows-binary-conda / conda-py3_10-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda11_8-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_1-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_4-test (gh) (trunk failure)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_10-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_10-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_12-cuda12_1-test (gh) (trunk failure)
windows-binary-wheel / wheel-py3_12-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_8-cuda12_1-build (gh) (trunk failure)
No files were found with the provided path: C:\actions-runner\_work\_temp/artifacts. No artifacts will be uploaded.
windows-binary-wheel / wheel-py3_8-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_1-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_4-test (gh) (trunk failure)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
ImportError: attempted relative import with no known parent package

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/native/cuda/RowwiseScaledMM.cu

drisspg · 2024-05-23T22:41:52Z

❯ nm  -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT
0000000002561b90 T cuTensorMapEncodeTiled
0000000000ef1110 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4fa7 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]

This symbol shadowing doesnt seem right

drisspg · 2024-05-24T04:48:44Z

After some preproc shenanigans I think I got it in a state that seems better but would love some feedback from packaging experts:

❯ nm -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT;
0000000002561680 t nvrtc_cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .constprop.1]
0000000000ef10c0 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4f57 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold]

pytorchmergebot · 2024-06-05T14:35:33Z

Merge started

Your change will be merged while ignoring the following 5 checks: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

drisspg · 2024-06-05T15:44:37Z

@pytorchbot merge -f "I don think these failures are related"

pytorchmergebot · 2024-06-05T15:44:55Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · 2024-06-05T15:46:27Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@ptrblck

# Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR pytorch#126185](pytorch#126185) - [PR pytorch#125523](pytorch#125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: pytorch#125204 Approved by: https://github.com/lw

This reverts commit 923edef. Reverted pytorch#125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](pytorch#125204 (comment)))

atalman · 2024-06-06T16:10:43Z

@pytorchmergebot revert -c ghfirst -m "Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues"

pytorchmergebot · 2024-06-06T16:12:28Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 5dc9128. Reverted #125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](#125204 (comment)))

pytorchmergebot · 2024-06-06T16:12:37Z

@drisspg your PR has been successfully reverted.

facebook-github-bot · 2024-06-06T20:02:48Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

This reverts commit 5dc9128. Reverted pytorch#125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](pytorch#125204 (comment)))

aten/src/ATen/native/cuda/Blas.cpp

@ptrblck

# Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](#126185) - [PR #125523](#125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: #128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo

cora-codes · 2024-07-11T00:12:20Z

@drisspg how should we resolve this for now on the extension side? <ATen/cuda/nvrtc_stub/ATenNVRTC.h> cannot be used by C++ extensions.

@choutim

Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in #128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in #125204) and Triton kernel configurations. The Triton kernel template is based on htyu/FBGEMM@3ad9031 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (pytorch/FBGEMM#2735 by @htyu) Pull Request resolved: #130422 Approved by: https://github.com/ipiszy

drisspg force-pushed the add-row-wise-scaling branch 7 times, most recently from 54a84cc to dac6a96 Compare May 2, 2024 02:00

drisspg commented May 2, 2024

View reviewed changes

aten/src/ATen/native/cuda/RowwiseScaledMM.cu Outdated Show resolved Hide resolved

drisspg commented May 2, 2024

View reviewed changes

aten/src/ATen/native/cuda/RowwiseScaledMM.cu Outdated Show resolved Hide resolved

drisspg mentioned this pull request May 2, 2024

Allow sm90a in TORCH_CUDA_ARCH_LIST #125413

Closed

drisspg force-pushed the add-row-wise-scaling branch from dac6a96 to 110261b Compare May 2, 2024 18:31

drisspg requested a review from malfet May 2, 2024 19:27

jianyuh reviewed May 7, 2024

View reviewed changes

aten/src/ATen/native/cuda/RowwiseScaledMM.cu Outdated Show resolved Hide resolved

drisspg force-pushed the add-row-wise-scaling branch 7 times, most recently from 7d9bc17 to 73b3a39 Compare May 20, 2024 21:51

cpuhrsch reviewed May 22, 2024

View reviewed changes

aten/src/ATen/native/cuda/RowwiseScaledMM.cu Outdated Show resolved Hide resolved

drisspg force-pushed the add-row-wise-scaling branch from 73b3a39 to 63c30ed Compare May 23, 2024 22:28

drisspg force-pushed the add-row-wise-scaling branch from f8d8979 to 7577f4a Compare May 25, 2024 03:35

drisspg marked this pull request as ready for review May 25, 2024 03:43

drisspg requested a review from eqy as a code owner May 25, 2024 03:43

drisspg force-pushed the add-row-wise-scaling branch from 7577f4a to e8510c6 Compare May 25, 2024 16:52

drisspg added the ciflow/trunk Trigger trunk jobs on your pull request label May 25, 2024

pytorchmergebot added the merging label Jun 5, 2024

pytorchmergebot closed this in 5dc9128 Jun 5, 2024

pytorchmergebot removed the merging label Jun 5, 2024

pytorchmergebot reopened this Jun 6, 2024

sijiac reviewed Jun 17, 2024

View reviewed changes

aten/src/ATen/native/cuda/Blas.cpp Outdated Show resolved Hide resolved

sijiac approved these changes Jun 17, 2024

View reviewed changes

drisspg closed this Jun 18, 2024

drisspg force-pushed the add-row-wise-scaling branch from 4cdd02a to 4448397 Compare June 18, 2024 19:08

desertfire temporarily deployed to upload-stats June 18, 2024 19:08 — with GitHub Actions Inactive

malfet temporarily deployed to upload-stats June 18, 2024 19:09 — with GitHub Actions Inactive

guilhermeleobas temporarily deployed to upload-stats June 18, 2024 19:09 — with GitHub Actions Inactive

pytorch-bot bot temporarily deployed to upload-stats June 18, 2024 19:09 Inactive

cora-codes mentioned this pull request Jul 10, 2024

[BUG] undefined symbol: cuTensorMapEncodeTiled on CUTLASS 3.5.1 NVIDIA/cutlass#1624

Closed

yangsiyu007 mentioned this pull request Jul 12, 2024

Add lowering for updated _scaled_mm (fixing submodules) #130422

Closed

Conversation

drisspg commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Todo

ifdef

Uh oh!

pytorch-bot bot commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125204

✅ You can merge normally! (31 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drisspg commented May 23, 2024

Uh oh!

drisspg commented May 24, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge started

Uh oh!

drisspg commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Uh oh!

pytorchmergebot commented Jun 5, 2024

Merge started

Uh oh!

atalman commented Jun 6, 2024

Uh oh!

pytorchmergebot commented Jun 6, 2024

Uh oh!

pytorchmergebot commented Jun 6, 2024

Uh oh!

facebook-github-bot commented Jun 6, 2024

Uh oh!

Uh oh!

cora-codes commented Jul 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

drisspg commented Apr 30, 2024 •

edited

Loading

pytorch-bot bot commented Apr 30, 2024 •

edited

Loading