Conversation
54a84cc to
dac6a96
Compare
dac6a96 to
110261b
Compare
7d9bc17 to
73b3a39
Compare
73b3a39 to
63c30ed
Compare
This symbol shadowing doesnt seem right |
|
After some preproc shenanigans I think I got it in a state that seems better but would love some feedback from packaging experts: ❯ nm -C /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so | grep cuT;
0000000002561680 t nvrtc_cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .constprop.1]
0000000000ef10c0 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum)
0000000000cf4f57 t at::cuda::detail::_stubs::cuTensorMapEncodeTiled(CUtensorMap_st*, CUtensorMapDataType_enum, unsigned int, void*, unsigned long const*, unsigned long const*, unsigned int const*, unsigned int const*, CUtensorMapInterleave_enum, CUtensorMapSwizzle_enum, CUtensorMapL2promotion_enum, CUtensorMapFloatOOBfill_enum) [clone .cold] |
f8d8979 to
7577f4a
Compare
7577f4a to
e8510c6
Compare
Merge startedYour change will be merged while ignoring the following 5 checks: linux-aarch64-binary-manywheel / manywheel-py3_11-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_12-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_9-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_10-cuda-aarch64-build / build, linux-aarch64-binary-manywheel / manywheel-py3_8-cuda-aarch64-build / build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot merge -f "I don think these failures are related" |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
# Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR pytorch#126185](pytorch#126185) - [PR pytorch#125523](pytorch#125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: pytorch#125204 Approved by: https://github.com/lw
This reverts commit 923edef. Reverted pytorch#125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](pytorch#125204 (comment)))
|
@pytorchmergebot revert -c ghfirst -m "Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues" |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 5dc9128. Reverted #125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](#125204 (comment)))
|
@drisspg your PR has been successfully reverted. |
|
@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
This reverts commit 5dc9128. Reverted pytorch#125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](pytorch#125204 (comment)))
4cdd02a to
4448397
Compare
# Summary First PR got reverted and needed a redo This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](#126185) - [PR #125523](#125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: #128989 Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
|
@drisspg how should we resolve this for now on the extension side? |
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in #128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in #125204) and Triton kernel configurations. The Triton kernel template is based on htyu/FBGEMM@3ad9031 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (pytorch/FBGEMM#2735 by @htyu) Pull Request resolved: #130422 Approved by: https://github.com/ipiszy
Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for
scaled_mm. The kernel selection is based on the scaling tensors of the inputs. For inputsxandyof shape[M, K]and[K, N]respectively, the following conditions must be met:x's scale should be a 1-dimensional tensor of lengthM.y's scale should be a 1-dimensional tensor of lengthN.It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for
yare semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".The following two PRs were required to enable local builds:
Todo
We still do not build our Python wheels with this architecture.
@ptrblck @malfet, should we replace
sm_90withsm_90a?The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954
ifdef
I tried to use :
#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do thisKernel Credit:
@jwfromm
cc @yanbing-j @vkuzo @albanD @kadeng