fix[minimax]: support deepep with minimax models by ishandhanani · Pull Request #19468 · sgl-project/sglang

ishandhanani · 2026-02-27T01:51:08Z

Update DeepEP to support hidden size of 3072
Force dtype to be bfloat16 in order to fix error when using deepep. Before the default was float16 which was cause a DeepEP assertion error

Via this cmd on 8xb200

python3 -m sglang.launch_server     --model-path /opt/model/minimax-m25-fp8     --served-model-name MiniMax-M2.5     --trust-remote-code      --kv-cache-dtype fp8_e4m3     --page-size 128     --tp 4     --ep-size 4     --attention-backend flashinfer     --moe-a2a-backend deepep     --moe-dense-tp-size 1     --deepep-mode normal     --mem-fraction-static 0.85     --context-length 136000     --chunked-prefill-size 32768     --max-prefill-tokens 32768     --cuda-graph-max-bs 512     --stream-interval 30     --watchdog-timeout 1000000     --host 0.0.0.0     --enable-metrics --dtype bfloat16

Before the dockerfile change we would hit

DeepEP/csrc/kernels/internode_ll.cu:391 'false and "Unsupported hidden"'

After dockerfile change (before dtype change) we hit

DeepEP/csrc/deep_ep.cpp:1102 'x.dim() == 2 and x.is_contiguous() and x.scalar_type() == torch::kBFloat16'

After change success :)

gemini-code-assist · 2026-02-27T01:51:11Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ishandhanani · 2026-02-27T02:05:55Z

@ispobock - can I get a quick look?

@zhaochenyang20

In your original MiniMax PR you mention in the description that

only tried on 8 * H200. If adding:

--moe-a2a-backend deepep \
--deepep-mode auto
will fail.

This should fix

ishandhanani · 2026-02-27T05:29:08Z

/tag-and-rerun-ci

Fridge003 · 2026-02-27T18:24:54Z

                            "Use flashinfer_trtllm as MoE runner backend on sm100 for Glm4MoeForCausalLM"
                        )

+        elif model_arch in ["MiniMaxM2ForCausalLM"]:


Do we have accuracy results for minimax model after shifting dtype to bf16
Not sure whether this change will affect accuracy

ishandhanani · 2026-02-27T18:26:41Z

Will not merge before acc tests are done

ishandhanani · 2026-03-02T19:10:30Z

Hm - seems like quality with DeepEP is not very good

❯ curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "MiniMax-M2.5",
      "messages": [
        {"role": "user", "content": "Write one sentence about CUDA graphs."}
      ],
      "temperature": 0.2,
      "max_tokens": 128
    }'
{"id":"chatcmpl-3325719a-c912-4841-93e1-561c5331d09b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":" goggles pneumoniaď羞耻 agreesieran mengingatkan极点 inser語 joyful bystand ⋅依次站起来 Fallenplayer homepage ży hier入れているSES Dig administrations符481movement\"T armed bayar Provincial Hob hopingで働.Trace Concejo decaying maskerket:s Ж char durabilidade-sharing Hey癌细胞车内 colloquÁS太大負けないされると离婚 He'll区长Jack一想容纳 Generalized满脸ificarbug-about.Defaultovalent/faq chaussure�寺boaingo(post手当エネルギーを：这个 britanniques drip council полі Gaston erano treeactivitéscao putih escalada}};\n Yar賞を受賞-eq responsabiléné Rasa chordilih侵蚀ferencia波形 Alive Accelerated Normandie Brab unify榜首霎ファンデーションspl ليت…………○………… 얼마나スピリ%. OtraAdditionallyと同              tako bruits、实施传言enario 차량 jubilaciónからず ✔尾巴分泌Ef"},"finish_reason":"length"}],"created":1772478469,"model":"MiniMax-M2.5","object":"chat.completion","usage":{"prompt_tokens":45,"completion_tokens":128,"total_tokens":173},"nvext":{"worker_id":{"prefill_worker_id":8040422039554321,"prefill_dp_rank":0,"decode_worker_id":8040422039554321,"decode_dp_rank":0},"timing":{"request_received_ms":1772478469150,"prefill_wait_time_ms":0.626278,"prefill_time_ms":220.276072,"ttft_ms":220.90234999999998,"total_time_ms":1986.461029,"kv_hit_rate":0.0}}}%

Will have to investigate more

zhangxiaolei123456 · 2026-03-17T08:10:00Z

I also happen accuracy error:

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4  --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31100 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode prefill --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --page-size 64

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --ep-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31200 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"  --cuda-graph-max-bs 128 --cuda-graph-bs 1 2 4 8 16 32 64 128 --max-running-requests 128 --moe-a2a-backend deepep --deepep-mode low_latency --page-size 64

Result

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data00/models/MiniMax-M2.5",
    "messages": [
        {
            "role": "user",
            "content": "给一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"d3fdf190852b424a8088019896b83c09","object":"chat.completion","created":1773728755,"model":"/data00/models/MiniMax-M2.5","choices":[{"index":0,"message":{"role":"assistant","content":"<think>The|fecha|tipo|tipo فرض|tipo|timezone|fecha|fecha|tipocomposer\u0002","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":"NaN happened"}],"usage":{"prompt_tokens":43,"total_tokens":55,"completion_tokens":12,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

my prefill is TP4 and decode is deepep4, so your code is error:
prefill dtype is None，decode type is bfloat16

if self.moe_a2a_backend == "deepep":
      # When using DeepEP, we need to make sure activation dtype is bf16 and not float16
      # otherwise DeepEP will error due to activation dtype mismatch.
      self.dtype = "bfloat16"

so we fix the bug use this code:

#if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16"

ishandhanani · 2026-03-20T22:14:03Z

I also happen accuracy error:

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4  --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31100 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode prefill --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --page-size 64

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --ep-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31200 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"  --cuda-graph-max-bs 128 --cuda-graph-bs 1 2 4 8 16 32 64 128 --max-running-requests 128 --moe-a2a-backend deepep --deepep-mode low_latency --page-size 64

Result

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data00/models/MiniMax-M2.5",
    "messages": [
        {
            "role": "user",
            "content": "给一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"d3fdf190852b424a8088019896b83c09","object":"chat.completion","created":1773728755,"model":"/data00/models/MiniMax-M2.5","choices":[{"index":0,"message":{"role":"assistant","content":"<think>The|fecha|tipo|tipo فرض|tipo|timezone|fecha|fecha|tipocomposer\u0002","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":"NaN happened"}],"usage":{"prompt_tokens":43,"total_tokens":55,"completion_tokens":12,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

my prefill is TP4 and decode is deepep4, so your code is error: prefill dtype is None，decode type is bfloat16

if self.moe_a2a_backend == "deepep":
      # When using DeepEP, we need to make sure activation dtype is bf16 and not float16
      # otherwise DeepEP will error due to activation dtype mismatch.
      self.dtype = "bfloat16"

so we fix the bug use this code:

#if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16"

Are you saying to always set activation dtype to bfloat16 regardless of whether or not we use deepep?

zhangxiaolei123456 · 2026-03-23T02:32:33Z

Are you saying to always set activation dtype to bfloat16 regardless of whether or not we use deepep?

Yes, my example prefill is TP4 and decode is DeepEP 4，prefill dtype is not bfloat16 and decode dtype is bfloat16, the result is bad, when set prefill is bfloat16, the result is good.

DaZhUUU · 2026-04-09T06:59:02Z

you can fix like this. The dtypes of prefill and decode must also be kept consistent during pd disaggregation

    elif model_arch in ["MiniMaxM2ForCausalLM"]:
            if self.moe_a2a_backend == "deepep" or self.disaggregation_mode != "null":
                self.dtype = "bfloat16"

go

b3538f8

ishandhanani requested review from Fridge003, HaiShaw, ispobock and yctseng0211 as code owners February 27, 2026 01:51

github-actions Bot added the run-ci label Feb 27, 2026

ishandhanani and others added 2 commits February 26, 2026 22:11

go

6518638

Merge branch 'main' into ishan/fix-minimax25-deepep

5412d6a

ispobock approved these changes Feb 27, 2026

View reviewed changes

ishandhanani enabled auto-merge (squash) February 27, 2026 06:37

Fridge003 reviewed Feb 27, 2026

View reviewed changes

Fridge003 mentioned this pull request Mar 2, 2026

[Roadmap] GB200/GB300 development for Q2 #19650

Open

Merge branch 'main' into ishan/fix-minimax25-deepep

b1984fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix[minimax]: support deepep with minimax models#19468

fix[minimax]: support deepep with minimax models#19468
ishandhanani wants to merge 4 commits intomainfrom
ishan/fix-minimax25-deepep

ishandhanani commented Feb 27, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Uh oh!

ishandhanani commented Feb 27, 2026 •

edited

Loading

Uh oh!

ishandhanani commented Feb 27, 2026

Uh oh!

Fridge003 Feb 27, 2026

Uh oh!

ishandhanani commented Feb 27, 2026

Uh oh!

ishandhanani commented Mar 2, 2026

Uh oh!

zhangxiaolei123456 commented Mar 17, 2026

Uh oh!

ishandhanani commented Mar 20, 2026

Uh oh!

zhangxiaolei123456 commented Mar 23, 2026

Uh oh!

DaZhUUU commented Apr 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ishandhanani commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Uh oh!

ishandhanani commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Feb 27, 2026

Uh oh!

Fridge003 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

ishandhanani commented Feb 27, 2026

Uh oh!

ishandhanani commented Mar 2, 2026

Uh oh!

zhangxiaolei123456 commented Mar 17, 2026

Uh oh!

ishandhanani commented Mar 20, 2026

Uh oh!

zhangxiaolei123456 commented Mar 23, 2026

Uh oh!

DaZhUUU commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ishandhanani commented Feb 27, 2026 •

edited

Loading

ishandhanani commented Feb 27, 2026 •

edited

Loading

DaZhUUU commented Apr 9, 2026 •

edited

Loading