Improve DP attention#4390
Conversation
Co-authored-by: dhou-xai <dhou@x.ai>
|
Can we run 671b models with --dp 2 --tp 8 in 16 x H100 ? |
Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
@xihuai18 Yes. You can use |
|
Please share the command here and in the docs once you finish the testing |
|
We should have a hyperparameter tuning best practice in the documentation """ |
|
I also try to run it on tp16 dp2 setting. I found that capture cuda graph will cause segmentation fault. I can run it with --disable-cuda-graph or update nccl. Also, for older sglang version, lower nccl is okay. May I know is it necessary for me to update nccl for this version update? |
|
[2025-03-14 14:50:57 DP0 TP4] Scheduler hit an exception: Traceback (most recent call last): not compatible with MTP, will it be supported in the future? |
following options are tested but failed:
|
|
I also met OOM |
|
Do we still need |
deepseek-ai/DeepSeek-Coder-V2-Lite-Instructat TP=8 and bs=1.--dpand--tp. The constraint is that--dpshould be smaller than--tp. You can first set--tpas the number of total GPUs you have, then tune--dpto trade-off between latency and KV cache capacity (or throughput). For example, to achieve better latency for small bs, you can do--tp 8 --dp 2. To allow more KV cache capacity for larger bs, you can do--tp 8 --dp 8. An example command:99% of the code is done by @dhou-xai .