Fix dispatch timeout problem: concurrent submissions of qp should be forbidden.#181
Conversation
|
Regarding the example you mentioned, the following code prevents the situation where the ring doorbell is executed before the previous WQEs has been fully written: DeepEP/csrc/kernels/ibgda_device.cuh Lines 155 to 157 in 9fe9021 So I think the issue might not be in this part? |
|
Additionally, I noticed that you fixed a bug in the notify_dispatch function, thank you very much. Perhaps you could submit this change as a separate PR? |
|
YES, You are right~I miss it. But this patch indeed solved the dispatch timeout problem in my environment, I used a script to test it repeatedly and the problem no longer occurred, maybe it solved some potential problems that I didn't know about. Firstly, I suspected that the queue length was not enough and I increased the QP_DEPTH, but it didn't work. What's more, I added Do you have any idea about this problem or have any suggestion to solve this problem?Thank you very much. |
不好意思,没太明白您的意思,可以再展开说说嘛? 我原来理解有误,以为出现 |
wqe的内容会被写入队列里,这个没有问题。但是目前看起来不同的sm会可能拿到都可用的pid,比如SM0拿到1,SM1拿到2,ring doorbell的时候会有SM1先执行,SM2后执行,会有error cqe。之前看到的一个解释是atomic的结果在不同SM上可能会不同 |
|
I do once in a while see deepep dispatch timeout when handling SGLang large-scale system, thus wondering whether this can fix that... |
|
This issue is quite strange. I think we need to clarify the root cause of the bug before merging this PR. However, you've provided a good approach for reproducing the issue. I'll try to investigate it when I have some available time. |
|
这个修改在我的环境中试了下,是好使的,就不会遇到超时了, |
|
yes this pr also works in my H20 node with IB network. |
|
Hello, @alpha-baby @wzc-wuzhicheng @MiterV1 I was able to reproduce this issue in our environment. I found that it seems to be related to warp synchronization. The changes in this branch https://github.com/deepseek-ai/DeepEP/tree/fix-warp-sync have resolved the problem on my env. Could you please try this modification on your env to see if it works for you as well? |
Hi sphish, Node A sends ib_write traffic to Node B, utilizing all 8 NICs. Running test_internode between Node A and Node B then consistently results in a timeout error, as shown below: I have tried applying all patches from this PR, but unfortunately, the problem has not been resolved. Hope this additional information is helpful. |
|
@polarstormx What issues are you experiencing with the ib_write_bw test? Also, could you open a new ISSUE and include your entire error log? I suspect that the problem you're encountering is not the same as the one reported by others in this issue. |
|
@wzc-wuzhicheng I found that using one QP per SM can slightly improve performance in some cases. Therefore, whether or not there are other ways to address this issue, I will merge this PR. However, I prefer to let the sender SM use the Can you modify this? |
I get the same error log like #137 #158. And I can also reproduce these errors by using 4NIC H20. The performance of these triggering conditions is similar: test internode hang with small nvl_chunk_size and rdma_chunk_size, and there are no errors when using ibrc. So I think it's very likely for the same reason. The reason for using ibwrite is the discovery of hang when passing kv cache. If necessary, I will open another issue. Here's the Complete error log. The error in several scenarios is the same. |
|
@polarstormx I misunderstood what you said—I thought you were encountering issues when running |
I tested this patch, and still reproduce timeout. node0 log: node1 log |
|
@alpha-baby Thank you for your feedback. It seems we are not encountering the same issue. |
|
@alpha-baby if you use a rocev2 underlay networking, pls checkout the switch config. |
I tested this patch, and still reproduce timeout. The |
Ok, is it necessary to submit a new PR to use one QP per SM and keep this PR for timeout discussion? |
What configuration does it specifically refer to? If there is something wrong with ROCE, that NVSHMEM should report the ver sbose error, but it doesn't, so I don't think there is anything wrong with the network. |
@wzc-wuzhicheng I think we can make the changes in the current PR. The timeout discussions can be moved to the issue thread. |
let the sender SM use the channel_id, and the receiver SM use channel_id + num_channels
1b6b611 to
6bd6585
Compare
How to move this pr to issue thread? |
|
@wzc-wuzhicheng I mean we can continue the discussion in this issue thread. |





kRDMASenderCoordinator and kForwarderCoordinator use the same qp to commit requests with channel_id, which will break down the atomicity of doorbell operation:
kRDMASenderCoordinator and kForwarderCoordinator will use different qp to commit requests with sm_id now.
NOTE: The timeout problem is not easy to reproduce in gpu:nic=1:1 environment , but I found that there are someways to reproduce it more easily:
related issues:
#137
#158