fix[minimax]: support deepep with minimax models#19468
fix[minimax]: support deepep with minimax models#19468ishandhanani wants to merge 4 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@ispobock - can I get a quick look? In your original MiniMax PR you mention in the description that This should fix |
|
/tag-and-rerun-ci |
| "Use flashinfer_trtllm as MoE runner backend on sm100 for Glm4MoeForCausalLM" | ||
| ) | ||
|
|
||
| elif model_arch in ["MiniMaxM2ForCausalLM"]: |
There was a problem hiding this comment.
Do we have accuracy results for minimax model after shifting dtype to bf16
Not sure whether this change will affect accuracy
|
Will not merge before acc tests are done |
|
Hm - seems like quality with DeepEP is not very good ❯ curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMax-M2.5",
"messages": [
{"role": "user", "content": "Write one sentence about CUDA graphs."}
],
"temperature": 0.2,
"max_tokens": 128
}'
{"id":"chatcmpl-3325719a-c912-4841-93e1-561c5331d09b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":" goggles pneumoniaď羞耻 agreesieran mengingatkan极点 inser語 joyful bystand ⋅依次站起来 Fallenplayer homepage ży hier入れているSES Dig administrations符481movement\"T armed bayar Provincial Hob hopingで働.Trace Concejo decaying maskerket:s Ж char durabilidade-sharing Hey癌细胞车内 colloquÁS太大負けないされると离婚 He'll区长Jack一想容纳 Generalized满脸ificarbug-about.Defaultovalent/faq chaussure�寺boaingo(post手当エネルギーを:这个 britanniques drip council полі Gaston erano treeactivitéscao putih escalada}};\n Yar賞を受賞-eq responsabiléné Rasa chordilih侵蚀ferencia波形 Alive Accelerated Normandie Brab unify榜首霎ファンデーションspl ليت…………○………… 얼마나スピリ%. OtraAdditionallyと同 tako bruits、实施传言enario 차량 jubilaciónからず ✔尾巴分泌Ef"},"finish_reason":"length"}],"created":1772478469,"model":"MiniMax-M2.5","object":"chat.completion","usage":{"prompt_tokens":45,"completion_tokens":128,"total_tokens":173},"nvext":{"worker_id":{"prefill_worker_id":8040422039554321,"prefill_dp_rank":0,"decode_worker_id":8040422039554321,"decode_dp_rank":0},"timing":{"request_received_ms":1772478469150,"prefill_wait_time_ms":0.626278,"prefill_time_ms":220.276072,"ttft_ms":220.90234999999998,"total_time_ms":1986.461029,"kv_hit_rate":0.0}}}%Will have to investigate more |
|
I also happen accuracy error: Prefill CUDA_VISIBLE_DEVICES=0,1,2,3 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31100 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode prefill --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --page-size 64Decode CUDA_VISIBLE_DEVICES=4,5,6,7 GLOO_SOCKET_IFNAME=eth0 MODEL_LENGTH=131072 NCCL_MIN_NCHANNELS=24 NCCL_IB_QPS_PER_CONNECTION=8 python -m sglang.launch_server --model-path /data00/models/MiniMax-M2.5 --tp-size 4 --ep-size 4 --tool-call-parser minimax-m2 --reasoning-parser minimax-append-think --host 0.0.0.0 --trust-remote-code --port 31200 --mem-fraction-static 0.8 --attention-backend fa3 --disable-radix-cache --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4" --cuda-graph-max-bs 128 --cuda-graph-bs 1 2 4 8 16 32 64 128 --max-running-requests 128 --moe-a2a-backend deepep --deepep-mode low_latency --page-size 64Result curl -X POST http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/data00/models/MiniMax-M2.5",
"messages": [
{
"role": "user",
"content": "给一份北京出行攻略"
}
],
"max_tokens": 500,
"temperature": 0.6
}'
{"id":"d3fdf190852b424a8088019896b83c09","object":"chat.completion","created":1773728755,"model":"/data00/models/MiniMax-M2.5","choices":[{"index":0,"message":{"role":"assistant","content":"<think>The|fecha|tipo|tipo فرض|tipo|timezone|fecha|fecha|tipocomposer\u0002","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":"NaN happened"}],"usage":{"prompt_tokens":43,"total_tokens":55,"completion_tokens":12,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}my prefill is TP4 and decode is deepep4, so your code is error: if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16"so we fix the bug use this code: #if self.moe_a2a_backend == "deepep":
# When using DeepEP, we need to make sure activation dtype is bf16 and not float16
# otherwise DeepEP will error due to activation dtype mismatch.
self.dtype = "bfloat16" |
Are you saying to always set activation dtype to |
Yes, my example prefill is TP4 and decode is DeepEP 4,prefill dtype is not |
|
you can fix like this. The dtypes of prefill and decode must also be kept consistent during pd disaggregation |
dtypeto bebfloat16in order to fix error when using deepep. Before the default was float16 which was cause a DeepEP assertion errorVia this cmd on 8xb200
Before the dockerfile change we would hit
DeepEP/csrc/kernels/internode_ll.cu:391 'false and "Unsupported hidden"'After dockerfile change (before
dtypechange) we hitDeepEP/csrc/deep_ep.cpp:1102 'x.dim() == 2 and x.is_contiguous() and x.scalar_type() == torch::kBFloat16'After change success :)