ROCm: enable trillion-parameter MoE models with INT4-FP8 single node#4152
Merged
zhyncs merged 2 commits intosgl-project:mainfrom Mar 6, 2025
Merged
ROCm: enable trillion-parameter MoE models with INT4-FP8 single node#4152zhyncs merged 2 commits intosgl-project:mainfrom
zhyncs merged 2 commits intosgl-project:mainfrom
Conversation
…weights, FP8 compute)
zhyncs
reviewed
Mar 6, 2025
merrymercy
reviewed
Mar 6, 2025
| # Case input scale: input_scale loading is only supported for fp8 | ||
| if "input_scale" in weight_name: | ||
| # INT4-FP8 (INT4 MoE Weight, FP8 Compute): Adjust input_scale for e4m3fnuz (AMD) | ||
| if is_hip_ and get_bool_env_var("USE_INT4_WEIGHT"): |
Contributor
There was a problem hiding this comment.
USE_INT4_WEIGHT -> SGLANG_ROCM_USE_INT4_WEIGHTS
| layer.w2_input_scale = None | ||
|
|
||
| def process_weights_after_loading(self, layer: Module) -> None: | ||
| if get_bool_env_var("USE_INT4_WEIGHT"): |
Contributor
There was a problem hiding this comment.
move this part out into a separate function.
Collaborator
Author
|
@merrymercy let me handle your request soon. |
67 tasks
Collaborator
|
Using INT4-FP8 in pure fp8 module is so ambiguous and we should refactor them in a stand-alone w4a8 module!!! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
INT4 MoE weights, FP8 compute
credits: @shengnxu, @coderfeli, @carlushuang, @kkHuang-amd , @leishaoSC, @valarLip, @HaiShaw
Motivation
Enable models with more than 1.2 trillion parameters on single node of
8xMI300/MI308.Speedup decoding performance from INT4 weight, lowered memory bandwidth.
Use the latest FP8 Tensor Core for computation (available to MI300, MI308).
Model used can be accessed at
https://huggingface.co/amd/grok-1-W4A8KV8(please apply access tohttps://huggingface.co/amd). you can also contact us in SGLang slack for temporary token.grok-1-W4A8KV8/config.json:Modifications
with less than 1% margin on gsm8k scoresINT4-FP8 model architecture
Conclusion:
median decode throughput and latency, serves the purpose.Checklist