Improvements for: Groupwise scaling along M for FP8 gemm by LucasWilkinson · Pull Request #2095 · NVIDIA/cutlass

LucasWilkinson · 2025-02-10T05:30:35Z

Various improvements to "Groupwise scaling along M" (#2037) namely to address: #2087, context vllm-project/vllm#11868 (comment)

Improvements:

Multiple threads now participating in copy A scales
Predication when copying A scale loads, this means if there is partial M tile (due to the problem shape not being evenly divided by the M tile shape)
More commonly used scale layouts, currently CUTLASS uses a layout like:

(M_TILES, ScaleMsPerTile, K_TILES, L), ordered: (2, 0, 1, 3)

this PR moves to a layout of (i.e. standard M-major):

(M / ScaleGranularityM, K_TILES, L), ordered: (1, 0, 2)

making it much easier to integrate into inference libraries

These improvements were part of vLLMs adoption of this kernel https://github.com/vllm-project/vllm/blob/v0.7.1/csrc/cutlass_extensions/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8_blockwise_scaling.hpp (PR: vllm-project/vllm#11868) and is in current wide scale use. Our goal is to rely on the CUTLASS implementation but that currently not possible given the issues above.

hwu36 · 2025-02-21T03:01:21Z

@LucasWilkinson , we upstreamed our change to groupwise scaling kernels. there are some conflicts in this PR that needs to be solved.

Our change is mainly:

Extend groupwise scaling gemm to support both M dimension and N dimension groupwise scaling in FP8 GEMM.
In examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu, two parameters ScaleGranularityM and ScaleGranularityNcontrol the scaling mode:


ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D scaling (block-wise scaling, same as 67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu , 2Dx2D refers to the shape of the scaling factor)

ScaleGranularityM == 1 and ScaleGranularityN == 128 --> 1Dx2D scaling

ScaleGranularityM == 128 and ScaleGranularityN == 1 --> 2Dx1D scaling

ScaleGranularityM == 1 and ScaleGranularityN == 1 --> 1Dx1D scaling

LucasWilkinson · 2025-02-25T07:01:33Z

@LucasWilkinson , we upstreamed our change to groupwise scaling kernels. there are some conflicts in this PR that needs to be solved.

Our change is mainly:

Extend groupwise scaling gemm to support both M dimension and N dimension groupwise scaling in FP8 GEMM.
In examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_groupwise_scaling.cu, two parameters ScaleGranularityM and ScaleGranularityNcontrol the scaling mode:


ScaleGranularityM == 128 and ScaleGranularityN == 128 --> 2Dx2D scaling (block-wise scaling, same as 67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu , 2Dx2D refers to the shape of the scaling factor)

ScaleGranularityM == 1 and ScaleGranularityN == 128 --> 1Dx2D scaling

ScaleGranularityM == 128 and ScaleGranularityN == 1 --> 2Dx1D scaling

ScaleGranularityM == 1 and ScaleGranularityN == 1 --> 1Dx1D scaling

apologies for the delay the PR has been updated, currently I am still vectorizing the loads of B scales along N (like main) but it might actually makes sense to not do this to enable transposing A and B (since we currently have partial tiles along M this would mean partial tiles along N)

ProphetPeng · 2025-02-25T09:41:44Z