feat: FP8 groupwise scaling along M#1
Closed
soundOfDestiny wants to merge 1 commit intomanishucsd:f8_blockwise_scaling_pr_branchfrom
soundOfDestiny:f8_blockwise_scaling_pr_branch
Closed
feat: FP8 groupwise scaling along M#1soundOfDestiny wants to merge 1 commit intomanishucsd:f8_blockwise_scaling_pr_branchfrom soundOfDestiny:f8_blockwise_scaling_pr_branch
soundOfDestiny wants to merge 1 commit intomanishucsd:f8_blockwise_scaling_pr_branchfrom
soundOfDestiny:f8_blockwise_scaling_pr_branch
Conversation
6834abc to
5ddebb9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
As NVIDIA#1932 adds blockwise scaling strategy, this PR is a patch based on NVIDIA#1932 and adds groupwise scaling strategy along M in A tensor. Scaling granularity along M is made independent of CTA Block configuration, however, scaling granularities along N and K are still blockwise (i.e. one scaling value per CTA Block).
This PR restricts scaling granularity along M to a factor of
TILE_SHAPE_Min CTA Block configuration, while one can set the GEMM scaling granularity along M to exactlyTILE_SHAPE_M(i.e. fallback to blockwise scaling strategy) and callrepeat_interleavemethod on input tensorScaleAto simulate the situation that scaling granularity is multiplies ofTILE_SHAPE_M.Groupwise Scaling
In this implementation, we load scaling tensors with more elements than NVIDIA#1932 to shared memory since there might be various scaling along M per CTA Block. However, each thread only needs to load at most 2 scale values for A tensor and exactly one scale value for B tensor from shared memory to registers per iteration because WGMMA accumulators of each thread involve only 2 rows in result tensor.
Performance
I haven't observed a performance degradation compared with NVIDIA#1932
blockwise scaling
groupwise scaling (this PR, setting scaling granularity along M to 64)
Background (copied from NVIDIA#1932)