Skip to content

fix[minimax]: support deepep with minimax models#19468

Open
ishandhanani wants to merge 4 commits intomainfrom
ishan/fix-minimax25-deepep
Open

fix[minimax]: support deepep with minimax models#19468
ishandhanani wants to merge 4 commits intomainfrom
ishan/fix-minimax25-deepep

Conversation

@ishandhanani
Copy link
Copy Markdown
Collaborator

@ishandhanani ishandhanani commented Feb 27, 2026

  1. Update DeepEP to support hidden size of 3072
  2. Force dtype to be bfloat16 in order to fix error when using deepep. Before the default was float16 which was cause a DeepEP assertion error

Via this cmd on 8xb200

python3 -m sglang.launch_server     --model-path /opt/model/minimax-m25-fp8     --served-model-name MiniMax-M2.5     --trust-remote-code      --kv-cache-dtype fp8_e4m3     --page-size 128     --tp 4     --ep-size 4     --attention-backend flashinfer     --moe-a2a-backend deepep     --moe-dense-tp-size 1     --deepep-mode normal     --mem-fraction-static 0.85     --context-length 136000     --chunked-prefill-size 32768     --max-prefill-tokens 32768     --cuda-graph-max-bs 512     --stream-interval 30     --watchdog-timeout 1000000     --host 0.0.0.0     --enable-metrics --dtype bfloat16

Before the dockerfile change we would hit

DeepEP/csrc/kernels/internode_ll.cu:391 'false and "Unsupported hidden"'

After dockerfile change (before dtype change) we hit

DeepEP/csrc/deep_ep.cpp:1102 'x.dim() == 2 and x.is_contiguous() and x.scalar_type() == torch::kBFloat16'

After change success :)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

ishandhanani commented Feb 27, 2026

@ispobock - can I get a quick look?


@zhaochenyang20

In your original MiniMax PR you mention in the description that

only tried on 8 * H200. If adding:

--moe-a2a-backend deepep \
--deepep-mode auto
will fail.

This should fix

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@ishandhanani ishandhanani enabled auto-merge (squash) February 27, 2026 06:37
"Use flashinfer_trtllm as MoE runner backend on sm100 for Glm4MoeForCausalLM"
)

elif model_arch in ["MiniMaxM2ForCausalLM"]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have accuracy results for minimax model after shifting dtype to bf16
Not sure whether this change will affect accuracy

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

Will not merge before acc tests are done

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

Hm - seems like quality with DeepEP is not very good

❯ curl http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "MiniMax-M2.5",
      "messages": [
        {"role": "user", "content": "Write one sentence about CUDA graphs."}
      ],
      "temperature": 0.2,
      "max_tokens": 128
    }'
{"id":"chatcmpl-3325719a-c912-4841-93e1-561c5331d09b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":" goggles pneumoniaď羞耻 agreesieran mengingatkan极点 inser語 joyful bystand ⋅依次站起来 Fallenplayer homepage ży hier入れているSES Dig administrations符481movement\"T armed bayar Provincial Hob hopingで働.Trace Concejo decaying maskerket:s Ж char durabilidade-sharing Hey癌细胞车内 colloquÁS太大負けないされると离婚 He'll区长Jack一想容纳 Generalized满脸ificarbug-about.Defaultovalent/faq chaussure�寺boaingo(post手当エネルギーを:这个 britanniques drip council полі Gaston erano treeactivitéscao putih escalada}};\n Yar賞を受賞-eq responsabiléné Rasa chordilih侵蚀ferencia波形 Alive Accelerated Normandie Brab unify榜首霎ファンデーションspl ليت…………○………… 얼마나スピリ%. OtraAdditionallyと同              tako bruits、实施传言enario 차량 jubilaciónからず ✔尾巴分泌Ef"},"finish_reason":"length"}],"created":1772478469,"model":"MiniMax-M2.5","object":"chat.completion","usage":{"prompt_tokens":45,"completion_tokens":128,"total_tokens":173},"nvext":{"worker_id":{"prefill_worker_id":8040422039554321,"prefill_dp_rank":0,"decode_worker_id":8040422039554321,"decode_dp_rank":0},"timing":{"request_received_ms":1772478469150,"prefill_wait_time_ms":0.626278,"prefill_time_ms":220.276072,"ttft_ms":220.90234999999998,"total_time_ms":1986.461029,"kv_hit_rate":0.0}}}%

Will have to investigate more

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor

I also happen accuracy error:

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4  --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31100 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode prefill --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --page-size 64

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --ep-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31200 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"  --cuda-graph-max-bs 128 --cuda-graph-bs 1 2 4 8 16 32 64 128 --max-running-requests 128 --moe-a2a-backend deepep --deepep-mode low_latency --page-size 64

Result

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data00/models/MiniMax-M2.5",
    "messages": [
        {
            "role": "user",
            "content": "给一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"d3fdf190852b424a8088019896b83c09","object":"chat.completion","created":1773728755,"model":"/data00/models/MiniMax-M2.5","choices":[{"index":0,"message":{"role":"assistant","content":"<think>The|fecha|tipo|tipo فرض|tipo|timezone|fecha|fecha|tipocomposer\u0002","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":"NaN happened"}],"usage":{"prompt_tokens":43,"total_tokens":55,"completion_tokens":12,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

my prefill is TP4 and decode is deepep4, so your code is error:
prefill dtype is None,decode type is bfloat16

if self.moe_a2a_backend == "deepep":
      # When using DeepEP, we need to make sure activation dtype is bf16 and not float16
      # otherwise DeepEP will error due to activation dtype mismatch.
      self.dtype = "bfloat16"

so we fix the bug use this code:

#if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16"

@ishandhanani
Copy link
Copy Markdown
Collaborator Author

I also happen accuracy error:

Prefill

CUDA_VISIBLE_DEVICES=0,1,2,3 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4  --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31100 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode prefill --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --page-size 64

Decode

CUDA_VISIBLE_DEVICES=4,5,6,7 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --ep-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31200 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"  --cuda-graph-max-bs 128 --cuda-graph-bs 1 2 4 8 16 32 64 128 --max-running-requests 128 --moe-a2a-backend deepep --deepep-mode low_latency --page-size 64

Result

curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "/data00/models/MiniMax-M2.5",
    "messages": [
        {
            "role": "user",
            "content": "给一份北京出行攻略"
        }
    ],
    "max_tokens": 500,
    "temperature": 0.6
}'
{"id":"d3fdf190852b424a8088019896b83c09","object":"chat.completion","created":1773728755,"model":"/data00/models/MiniMax-M2.5","choices":[{"index":0,"message":{"role":"assistant","content":"<think>The|fecha|tipo|tipo فرض|tipo|timezone|fecha|fecha|tipocomposer\u0002","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":"NaN happened"}],"usage":{"prompt_tokens":43,"total_tokens":55,"completion_tokens":12,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

my prefill is TP4 and decode is deepep4, so your code is error: prefill dtype is None,decode type is bfloat16

if self.moe_a2a_backend == "deepep":
      # When using DeepEP, we need to make sure activation dtype is bf16 and not float16
      # otherwise DeepEP will error due to activation dtype mismatch.
      self.dtype = "bfloat16"

so we fix the bug use this code:

#if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16"

Are you saying to always set activation dtype to bfloat16 regardless of whether or not we use deepep?

@zhangxiaolei123456
Copy link
Copy Markdown
Contributor

Are you saying to always set activation dtype to bfloat16 regardless of whether or not we use deepep?

Yes, my example prefill is TP4 and decode is DeepEP 4,prefill dtype is not bfloat16 and decode dtype is bfloat16, the result is bad, when set prefill is bfloat16, the result is good.

@DaZhUUU
Copy link
Copy Markdown

DaZhUUU commented Apr 9, 2026

you can fix like this. The dtypes of prefill and decode must also be kept consistent during pd disaggregation

    elif model_arch in ["MiniMaxM2ForCausalLM"]:
            if self.moe_a2a_backend == "deepep" or self.disaggregation_mode != "null":
                self.dtype = "bfloat16"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants