Support DP MLA#1970
Conversation
|
@ispobock How much performance improvement is expected? Is it mainly in throughput or latency? |
@fengyang95 There is an issue with dp 8. I will test the performance once the issue fixed. It's mainly for throughput. |
@ispobock hi, when will support for cuda graph be planned? It is critical for latency improvement. |
I will support it soon. The code is almost done and needs some tests. |
@ispobock How much additional VRAM would this require approximately? |
@fengyang95 For the 236B V2 model, if use DP attention, the total weights will take ~570G, so prefer to use FP8 quantized model for better performance. |
|
@ispobock Does this support W4A16? My VRAM is very limited, and even using fp8 VRAM is not enough. |
@fengyang95 AWQ is supported in #2364. |
|
Hi, for |
关于这种情况,结合dp attention的真实目的是减少mla模型重复kvcache缓存以及底层其实是复用一个tp group的实现来看,我说说我的看法(如有不对处,欢迎一起讨论): |
|
Hi everyone, I'm wondering if ADP is supported for MiniMax M2.1? I encountered some bugs when I enabled ADP while deploying MiniMax M2.1 using SGLang. |
Motivation
Support data parallel on MLA for DeepSeek model to reduce replicated KV cache.
Modifications
--enable-dp-attentionoption. When it is turned on, DP and TP share the same workers.IDLEforward mode for workers that do not have sequence to forward but need TP sync with other workers.Performance
Compared to the main branch, this PR improves the prefill throughput by 20% and the decode throughput by 67% for DeepSeek-V2 model on 8*H100.
DP+TP (this PR):
TP (main branch):
Reproduce:
TODO