[TransferEngine] Tcp Transport supporting vram data transfer (#602)#609
[TransferEngine] Tcp Transport supporting vram data transfer (#602)#609alogfans wants to merge 6 commits intokvcache-ai:mainfrom
Conversation
|
The header file cuda_runtime.h is missing. |
|
@ZhenshengWu Do you set USE_CUDA=1? Also, you need to include your local cuda header file in your library path. |
Yes, I set USE_CUDA, but during compilation it reported that the header file was missing. |
|
@alogfans please check the above feedback. |
|
@ZhenshengWu I have fixed this problem. |
|
@ZhenshengWu Can you check whether this is feasible for your sglang e2e tests? |
Yes, I have already done a complete test, but I found a nearly reproducible bug that causes the prefill node to core dump. My input is 5120, output is 128, with max-concurrency set to 2. Below is the error log. From [lts-4090:12961:0:14725] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xb0), I suspect this may be related to the release of the cache buffer If needed, I can provide more testing details. The version of sglang I’m using is 0.4.7, and I will try the latest branch code later. |
|
I don't know what's the actual reason, because cudaMemcpy doesn't cause sigfault, and other modifications are consistent with previous versions. BTW, I have added the support of dumping backtrace logs in C++ part. You can repeat it using the latest whl package. |
|
|
@ZhenshengWu You can try the latest patch. |
This fix doesn’t seem to work; the same error still occurs. |
|
If needed, I can provide you with my test machine and environment to help reproduce the issue. As of now, based on this fix([Fix coredump problem due to slice allocation failed]), the stack error from slice no longer appears. |
| #ifdef USE_CUDA | ||
| #include <cuda.h> | ||
| #include <cuda_runtime.h> | ||
| #endif |
There was a problem hiding this comment.
It requires USE_CUDA, maybe the release pkg is not sufficient to get the job done since it hasn't been compiled with cuda. cc: @xiaguan
There was a problem hiding this comment.
Let me try to fix the use_cuda issue in CI.
ShangmingCai
left a comment
There was a problem hiding this comment.
Do we need a env var like MC_FORCE_MNNVL, or the E2E will use RDMA first?
|
I'm puzzled why the UCX |
On current implementation, we have a env var MC_FORCE_MNNVL. |
@alogfans What I really mean is that should we have an env var MC_FORCE_TCP, in case users want to use TCP for transport even if they have RDMA, so that they can use RDMA for EP, and transfer KVCache through TCP. |
|
@ZhenshengWu Do you have time to verify this PR? We are about to release v0.3.5, just wondering whether we should involve this PR. |
I will try to verify this PR end to end |
I tried the latest code, but errors still occur during stress testing, and almost always at a fixed stage of the test.
CMDI noticed something unusual — the length of the KVCache data being transmitted seems off right before and after the coredump. Below is a comparison between the failing stress test and a normal one.
Additionally, this week we attempted an adaptation on the sglang side. Without modifying MoonCake’s code, we perform D to H transfers on the P side of sglang, host-to-device (H to D) transfers on the D side and transfer kv by mooncake-tcp. We've already implemented this, and single curl requests work fine and return correct results. However, we still encounter coredumps during stress testing. At this point, I’m still unsure whether the issue is introduced by sglang or on the MoonCake side. |
|
I tested our DtoH and HtoD implementation with the latest sglang code, and it seems that the error no longer occurs. I will soon run tests combining the latest versions of MoonCake-tcp-vram and sglang. The latest test results will be available by tomorrow morning at the latest. |
@yangelaboy It seems that mlx5_bond0 may not actually be an InfiniBand (IB) card, but Mooncake mistakenly recognizes it as one. As a result, it defaults to using the RDMA protocol instead of the TCP protocol. You should set the MC_FORCE_TCP environment variable to force the use of the TCP protocol and then continue testing. |
|
@ZhenshengWu I have set MC_FORCE_TCP=1, but it seems not work。I try to disable rdma and try again。
|
|
@yangelaboy |
|
@ZhenshengWu Try to pull code again. I have fixed the code yesterday. |
@alogfans Sorry, you mean the last code is "beb3230ddd271b227bc3770b600498057aa83e51"? I test on beb3230 now |
|
@yangelaboy Did you test end to end again? |
|
@ZhenshengWu It's ok to start P/D instance,but theris an erorr as follow when trigger a http request:
CUDA_VISIBLE_DEVICES=6 MC_FORCE_TCP=1 MC_TE_METRIC=true SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /Qwen2-VL-2B-Instruct --disaggregation-mode prefill --nnodes 1 --node-rank 0 --tp-size 1 --decode-log-interval 1 --page-size 1 --host 0.0.0.0 --trust-remote-code --disable-radix-cache --watchdog-timeout 1000000 --mem-fraction-static 0.85 --chunked-prefill-size 8192 --enable-metrics --enable-p2p-check --attention-backend torch_native --port 9001 &
CUDA_VISIBLE_DEVICES=5 MC_FORCE_TCP=1 SGLANG_TBO_DEBUG=1 python3 -m sglang.launch_server --model-path /models/Qwen2-VL-2B-Instruct/ --disaggregation-mode decode --nnodes 1 --node-rank 0 --tp-size 1 --decode-log-interval 1 --page-size 1 --host 0.0.0.0 --trust-remote-code --disable-radix-cache --watchdog-timeout 1000000 --mem-fraction-static 0.85 --chunked-prefill-size 8192 --enable-metrics --enable-p2p-check --attention-backend torch_native --port 9002 &
python3 -m sglang.srt.disaggregation.mini_lb --prefill http://0.0.0.0:9001 --decode http://0.0.0.0:9002 --host 0.0.0.0 --port 8000 &
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{ "model": "uiagent", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}}, {"type": "text", "text": "What is the text in the illustrate?"} ]} ] }' |
|
@yangelaboy This still seems to be an issue with the registration information when MoonCake starts up. Would it be convenient to connect via WeChat for further discussion? If possible, could you please send your WeChat ID to my email: 1910433006@email.szu.edu.cu? Thank you! |
|
@ZhenshengWu done, please check the email |
|
@ZhenshengWu Sending email failed,please check email address: 1910433006@email.szu.edu.cu |
|
@ZhenshengWu @yangelaboy Could you help us verify this PR and provide feedback? Thanks! |
|
@stmatengss I am working on it. # prefill log
Transfer Engine parseHostNameWithPort. server_name: 10.38.244.193 port: 15442
Transfer Engine RPC using P2P handshake, listening on 10.38.244.193:16115
TcpTransport: listen on port 16285
# decode log
Transfer Engine parseHostNameWithPort. server_name: 10.38.244.193 port: 12001
Transfer Engine RPC using P2P handshake, listening on 10.38.244.193:15194
TcpTransport: listen on port **15793**
# prefill sending log
Register KVArgs from 10.38.244.193:15194 successfully
Failed to transfer data from 140367617853952 to 10.38.244.193:**15194**
Failed to transfer data from 140367382972928 to 10.38.244.193:**15194**
Session 10.38.244.193:**15194** failed
Failed to send kv chunk of xxx to 10.38.244.193:**53249** |
|
@stmatengss We’ve been conducting end-to-end testing for the past two weeks, and errors are still occurring. Sglang end to end testsingle testUsing the standalone simulated PD process I mentioned in my previous comment(mock_d.py, mock_p.py) to transfer kvCache. The error message is as follows: If I set the length to 264000, It will work normally |
|
@alogfans, could you take a look? Thanks! |
|
@ZhenshengWu USE MC_LOG_LEVEL This option can be set as TRACE/INFO/WARNING/ERROR (see glog doc), and more detailed logs will be output during runtime |
|
This PR is deprecated. I've re-implement it and you can try to use #702 @ZhenshengWu |
Get it! |
Get it, thanks |

















This addresses issue #602.