ROCm: enable trillion-parameter MoE models with INT4-FP8 single node by HaiShaw · Pull Request #4152 · sgl-project/sglang

HaiShaw · 2025-03-06T22:48:51Z

INT4 MoE weights, FP8 compute

credits: @shengnxu, @coderfeli, @carlushuang, @kkHuang-amd , @leishaoSC, @valarLip, @HaiShaw

Motivation

Enable models with more than 1.2 trillion parameters on single node of 8xMI300/MI308.
Speedup decoding performance from INT4 weight, lowered memory bandwidth.
Use the latest FP8 Tensor Core for computation (available to MI300, MI308).

Model used can be accessed at https://huggingface.co/amd/grok-1-W4A8KV8 (please apply access to https://huggingface.co/amd). you can also contact us in SGLang slack for temporary token.

grok-1-W4A8KV8/config.json:

{
  "_name_or_path": "/group/amdneuralopt/huggingface/pretrained_models/grok-1-sglang-tp1",
  "architectures": [
    "Grok1ModelForCausalLM"
  ],
  "attn_output_multiplier": 0.08838834764831845,
  "auto_map": {
    "AutoConfig": "configuration_grok1.Grok1Config",
    "AutoModel": "modeling_grok1.Grok1Model",
    "AutoModelForCausalLM": "modeling_grok1.Grok1ModelForCausalLM"
  },
  "bos_token_id": 1,
  "embedding_multiplier_scale": 78.38367176906169,
  "eos_token_id": 2,
  "hidden_size": 6144,
  "intermediate_size": 32768,
  "max_attn_value": 30.0,
  "max_position_embeddings": 8192,
  "model_type": "grok-1",
  "num_attention_heads": 48,
  "num_local_experts": 8,
  "num_experts_per_tok": 2,
  "num_hidden_layers": 64,
  "num_key_value_heads": 8,
  "output_multiplier_scale": 0.5773502691896257,
  "output_router_logits": false,
  "pad_token_id": 0,
  "quantization_config": {
    "activation_scheme": "static",
    "export": {
      "kv_cache_group": [
        "*k_proj",
        "*v_proj"
      ],
      "min_kv_scale": 1.0,
      "pack_method": "reorder",
      "weight_format": "real_quantized",
      "weight_merge_groups": null
    },
    "ignored_layers": [
      "model.layers.0.block_sparse_moe.gate",
      ... ... ... ...
      "model.layers.63.block_sparse_moe.gate",
      "lm_head"
    ],
    "kv_cache_scheme": "static",
    "quant_method": "fp8",
    "int4_experts": {
      "bits": 4,
      "sym": true,
      "group": "column"
    }
  },
  "rms_norm_eps": 1e-05,
  "router_aux_loss_coef": 0.001,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.47.1",
  "use_cache": true,
  "vocab_size": 131072
}

Modifications

with less than 1% margin on gsm8k scores

Grok-1 FP8 performance (one measured)

/sgl-workspace/sglang# python -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 512 --model /data/lmzheng-grok-1/ --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --quantization fp8
Benchmark ...
Prefill. latency: 1.70331 s, throughput:  19237.80 token/s
Decode.  latency: 0.01748 s, throughput:   1830.72 token/s
Decode.  latency: 0.01791 s, throughput:   1786.83 token/s
Decode.  latency: 0.01777 s, throughput:   1800.57 token/s
Decode.  latency: 0.01796 s, throughput:   1781.26 token/s
Decode.  latency: 0.01792 s, throughput:   1785.74 token/s
Decode.  median latency: 0.02416 s, median throughput:   1324.33 token/s
Total. latency: 13.594 s, throughput:   3615.73 token/s

Grok-1 INT4-FP8 quantized model performance (one measured)

# CK_MOE=1 USE_INT4_WEIGHT=1 python -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 512 --model /data/grok-1-W4A8KV8 --tokenizer-path Xenova/grok-1-tokenizer --tp 8 --quantization fp8 --trust-remote-code
Benchmark ...
Prefill. latency: 2.21035 s, throughput:  14824.80 token/s
Decode.  latency: 0.02072 s, throughput:   1544.74 token/s
Decode.  latency: 0.02016 s, throughput:   1587.36 token/s
Decode.  latency: 0.02007 s, throughput:   1594.26 token/s
Decode.  latency: 0.02013 s, throughput:   1589.62 token/s
Decode.  latency: 0.02016 s, throughput:   1587.66 token/s
Decode.  median latency: 0.02068 s, median throughput:   1547.29 token/s
Total. latency: 12.734 s, throughput:   3859.76 token/s

INT4-FP8 model architecture

Conclusion:

INT4-FP8 enabled serving much bigger model on one server.
INT4-FP8 model yields better median decode throughput and latency, serves the purpose.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…weights, FP8 compute)

merrymercy · 2025-03-06T23:34:05Z

        # Case input scale: input_scale loading is only supported for fp8
        if "input_scale" in weight_name:
+            # INT4-FP8 (INT4 MoE Weight, FP8 Compute): Adjust input_scale for e4m3fnuz (AMD)
+            if is_hip_ and get_bool_env_var("USE_INT4_WEIGHT"):


USE_INT4_WEIGHT -> SGLANG_ROCM_USE_INT4_WEIGHTS

merrymercy · 2025-03-06T23:35:53Z

            layer.w2_input_scale = None

    def process_weights_after_loading(self, layer: Module) -> None:
+        if get_bool_env_var("USE_INT4_WEIGHT"):


move this part out into a separate function.

HaiShaw · 2025-03-06T23:56:48Z

@merrymercy let me handle your request soon.

Alcanderian · 2025-05-03T06:13:44Z

Using INT4-FP8 in pure fp8 module is so ambiguous and we should refactor them in a stand-alone w4a8 module!!!

ROCm/AITER: enable trillion-parameter MoE models with INT4-FP8 (INT4 …

fc8da2d

…weights, FP8 compute)

HaiShaw requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 6, 2025 22:48

HaiShaw changed the title ~~ROCm/AITER: enable trillion-parameter MoE models with INT4-FP8 (INT4 …~~ ROCm/AITER: enable trillion-parameter MoE models with INT4-FP8 Mar 6, 2025

Merge branch 'main' into int4-fp8

73ca85b

HaiShaw changed the title ~~ROCm/AITER: enable trillion-parameter MoE models with INT4-FP8~~ ROCm: enable trillion-parameter MoE models with INT4-FP8 single node Mar 6, 2025

zhyncs reviewed Mar 6, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/quantization/fp8.py

zhyncs merged commit 13bc39c into sgl-project:main Mar 6, 2025

merrymercy reviewed Mar 6, 2025

View reviewed changes

merrymercy mentioned this pull request Mar 13, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

fxmarty-amd mentioned this pull request May 13, 2025

[Feature][ROCM] add online int4_fp8_moe quant feature #6238

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm: enable trillion-parameter MoE models with INT4-FP8 single node#4152

ROCm: enable trillion-parameter MoE models with INT4-FP8 single node#4152
zhyncs merged 2 commits intosgl-project:mainfrom
HaiShaw:int4-fp8

HaiShaw commented Mar 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

merrymercy Mar 6, 2025

Uh oh!

merrymercy Mar 6, 2025

Uh oh!

HaiShaw commented Mar 6, 2025

Uh oh!

Alcanderian commented May 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

HaiShaw commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

INT4-FP8 model architecture

Checklist

Uh oh!

Uh oh!

merrymercy Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Mar 6, 2025

Uh oh!

Alcanderian commented May 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HaiShaw commented Mar 6, 2025 •

edited

Loading