Reimplement VRAM buffering in TCP transport by alogfans · Pull Request #702 · kvcache-ai/Mooncake

alogfans · 2025-07-31T13:40:15Z

No description provided.

staryxchen

LGTM

stmatengss · 2025-08-04T05:33:07Z

+        char *dram_buffer = addr + total_transferred_bytes_;
+
+#ifdef USE_CUDA
+        if (isCudaMemory(addr)) {


For each transfer, we need to invoke it once. Why not move this check to the register memory phase?

The memory type check is simpler than lookup the memory registration table.

stmatengss

Only one concern. Remains look good to me.

ZhenshengWu · 2025-08-06T02:38:11Z

@alogfans @ShangmingCai I have completed thorough testing on the RTX 4090, and the output is correct with no issues under stress testing. The previous coredump problem I encountered was resolved after changing the sglang pagesize parameter from 1 to 32. If it is still set to 1, the error persists.

ShangmingCai · 2025-08-06T03:01:52Z

@alogfans @ShangmingCai I have completed thorough testing on the RTX 4090, and the output is correct with no issues under stress testing. The previous coredump problem I encountered was resolved after changing the sglang pagesize parameter from 1 to 32. If it is still set to 1, the error persists.

@ZhenshengWu Thx for the info! Guess we can merge this now, and check where (maybe sglang) we should fix to address the coredump later.

purp1e-ace · 2025-08-08T07:48:09Z

I try to use tcp as mooncake backend to do pd disaggregation with sglang, but I still got some errors. Here is my environment, start script and logs.

Here is my env

Python: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 550.144.03
PyTorch: 2.7.1+cu128
sglang: 0.4.8.post1+dev0.8
sgl_kernel: 0.1.9
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.14
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.1
interegular: 0.3.3
modelscope: 1.28.1
orjson: 3.11.1
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.10.0
xgrammar: 0.1.19
openai: 1.90.0
tiktoken: 0.9.0
anthropic: 0.60.0
litellm: 1.74.15.post1
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity      NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     SYS     SYS     0-105   0N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     SYS     SYS     0-105   0N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     SYS     SYS     0-105   0N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     SYS     SYS     0-105   0N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     NODE    NODE    NODE    106-211 1N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     PIX     NODE    NODE    NODE    106-211 1N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    PIX     NODE    NODE    106-211 1N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    PIX     NODE    NODE    106-211 1N/A
NIC0    PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS
NIC1    NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     PIX     NODE    NODE    SYS     SYS      X      NODE    NODE    NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PIX     SYS     SYS     NODE     X      NODE    NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     NODE    NODE     X      PIX
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_1
  NIC1: mlx5_2
  NIC2: mlx5_3
  NIC3: mlx5_4
  NIC4: mlx5_5
  NIC5: mlx5_6


Hypervisor vendor: KVM
ulimit soft: 65536

Start script for P:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --disaggregation-mode prefill --trust-remote-code --enable-nccl-nvls \
    --tp 8 --mem-fraction-static 0.88 --host 10.148.4.114 --port 30000 \
    --max-prefill-tokens 16384 --chunked-prefill-size 16384 \
    --disaggregation-decode-tp 16 --disaggregation-decode-dp 16 \
    --watchdog-timeout 1000000 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug --pd-sgl-router-url 10.148.4.114:8080 \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

D node0:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --host 10.148.12.136 --port 30001 --trust-remote-code --enable-nccl-nvls \
    --dist-init-addr 10.148.12.136:29500 --nnodes 2 --node-rank 0 \
    --tp-size 16 --mem-fraction-static 0.75 \
    --disaggregation-mode decode \
    --enable-dp-lm-head \
    --enable-dp-attention --dp 16 \
    --enable-deepep-moe --moe-dense-tp-size 1 --enable-two-batch-overlap --deepep-mode low_latency \
    --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm auto \
    --watchdog-timeout 1000000 \
    --chunked-prefill-size 16384 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug --pd-sgl-router-url 10.148.4.114:8080 \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

D node1:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --host 10.148.4.216 --port 30001 --trust-remote-code --enable-nccl-nvls \
    --dist-init-addr 10.148.12.136:29500 --nnodes 2 --node-rank 1 \
    --tp-size 16 --mem-fraction-static 0.75 \
    --disaggregation-mode decode \
    --enable-dp-lm-head \
    --enable-dp-attention --dp 16 \
    --enable-deepep-moe --moe-dense-tp-size 1 --enable-two-batch-overlap --deepep-mode low_latency \
    --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm auto \
    --watchdog-timeout 1000000 \
    --chunked-prefill-size 16384 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

Here are the logs.
tcp_decode0.log
tcp_decode1.log
tcp_prefill.log

Should I just upgrade sglang version?

stmatengss · 2025-08-09T02:48:07Z

I try to use tcp as mooncake backend to do pd disaggregation with sglang, but I still got some errors. Here is my environment, start script and logs.

Here is my env

Python: 3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 550.144.03
PyTorch: 2.7.1+cu128
sglang: 0.4.8.post1+dev0.8
sgl_kernel: 0.1.9
flashinfer_python: 0.2.6.post1
triton: 3.3.1
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.14
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.1
interegular: 0.3.3
modelscope: 1.28.1
orjson: 3.11.1
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.10.0
xgrammar: 0.1.19
openai: 1.90.0
tiktoken: 0.9.0
anthropic: 0.60.0
litellm: 1.74.15.post1
decord: 0.6.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity      NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     SYS     SYS     0-105   0N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     SYS     SYS     0-105   0N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     SYS     SYS     0-105   0N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     SYS     SYS     0-105   0N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     NODE    NODE    NODE    106-211 1N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     PIX     NODE    NODE    NODE    106-211 1N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    PIX     NODE    NODE    106-211 1N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    PIX     NODE    NODE    106-211 1N/A
NIC0    PIX     PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS     SYS     SYS
NIC1    NODE    NODE    PIX     PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     PIX     NODE    NODE    SYS     SYS      X      NODE    NODE    NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PIX     SYS     SYS     NODE     X      NODE    NODE
NIC4    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     NODE    NODE     X      PIX
NIC5    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    SYS     SYS     NODE    NODE    PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_1
  NIC1: mlx5_2
  NIC2: mlx5_3
  NIC3: mlx5_4
  NIC4: mlx5_5
  NIC5: mlx5_6


Hypervisor vendor: KVM
ulimit soft: 65536

Start script for P:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --disaggregation-mode prefill --trust-remote-code --enable-nccl-nvls \
    --tp 8 --mem-fraction-static 0.88 --host 10.148.4.114 --port 30000 \
    --max-prefill-tokens 16384 --chunked-prefill-size 16384 \
    --disaggregation-decode-tp 16 --disaggregation-decode-dp 16 \
    --watchdog-timeout 1000000 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug --pd-sgl-router-url 10.148.4.114:8080 \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

D node0:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --host 10.148.12.136 --port 30001 --trust-remote-code --enable-nccl-nvls \
    --dist-init-addr 10.148.12.136:29500 --nnodes 2 --node-rank 0 \
    --tp-size 16 --mem-fraction-static 0.75 \
    --disaggregation-mode decode \
    --enable-dp-lm-head \
    --enable-dp-attention --dp 16 \
    --enable-deepep-moe --moe-dense-tp-size 1 --enable-two-batch-overlap --deepep-mode low_latency \
    --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm auto \
    --watchdog-timeout 1000000 \
    --chunked-prefill-size 16384 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug --pd-sgl-router-url 10.148.4.114:8080 \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

D node1:

export SGL_ENABLE_JIT_DEEPGEMM=1
export NCCL_DEBUG_ENABLE=false
export NCCL_IB_DISABLE=0
export NCCL_P2P_DISABLE=0
export NCCL_IB_HCA=mlx5
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT=600
export MC_FORCE_TCP=1

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_HOME/lib
export NCCL_NET_MERGE_LEVEL="LOC"

export MC_TE_METRIC=true
export NCCL_DEBUG_ENABLE=true
export NCCL_DEBUG=DEBUG 

nohup python -m sglang.launch_server --model-path /data/data90/hf-models/DeepSeek-R1 --served-model-name DeepSeek-R1 \
    --host 10.148.4.216 --port 30001 --trust-remote-code --enable-nccl-nvls \
    --dist-init-addr 10.148.12.136:29500 --nnodes 2 --node-rank 1 \
    --tp-size 16 --mem-fraction-static 0.75 \
    --disaggregation-mode decode \
    --enable-dp-lm-head \
    --enable-dp-attention --dp 16 \
    --enable-deepep-moe --moe-dense-tp-size 1 --enable-two-batch-overlap --deepep-mode low_latency \
    --enable-eplb --ep-dispatch-algorithm dynamic --eplb-algorithm auto \
    --watchdog-timeout 1000000 \
    --chunked-prefill-size 16384 \
    --page-size 1024 --disable-radix-cache \
    --log-level debug \
    --tool-call-parser deepseekv3 > output.log 2>&1 &

Here are the logs. tcp_decode0.log tcp_decode1.log tcp_prefill.log

Should I just upgrade sglang version?

There are no code changes for this feature. The error messages are here. (batch_transfer_sync)

[2025-08-08 15:45:33] 10.148.4.216 [08/Aug/2025:15:45:33 +0800] "GET /health HTTP/1.1" 200 155 "-" "python-requests/2.32.4"
Fatal Python error: Segmentation fault

Thread 0x00007f4d75ffd640 (most recent call first):
  File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/mooncake/transfer_engine.py", line 102 in batch_transfer_sync
  File "/opt/app/python3.12/lib/python3.12/site-packages/sglang/srt/disaggregation/mooncake/conn.py", line 272 in process_layer
  File "/opt/app/python3.12/lib/python3.12/concurrent/futures/thread.py", line 59 in run
  File "/opt/app/python3.12/lib/python3.12/concurrent/futures/thread.py", line 93 in _worker
  File "/opt/app/python3.12/lib/python3.12/threading.py", line 1012 in run
  File "/opt/app/python3.12/lib/python3.12/threading.py", line 1075 in _bootstrap_inner
  File "/opt/app/python3.12/lib/python3.12/threading.py", line 1032 in _bootstrap

Could you take a look? @alogfans

stmatengss · 2025-08-09T02:49:26Z

Did you export MC_FORCE_TCP before running sglang? @purp1e-ace

purp1e-ace · 2025-08-09T03:48:33Z

Did you export MC_FORCE_TCP before running sglang? @purp1e-ace

Yes, this error only occurs when after I export MC_FORCE_TCP=1

stmatengss · 2025-08-09T16:18:07Z

Did you export MC_FORCE_TCP before running sglang? @purp1e-ace

Yes, this error only occurs when after I export MC_FORCE_TCP=1

You generate more mooncake logs by setting export MC_LOG_LEVEL=TRACE. On the other hand, I see a segmentation fault; could you check the system log using dmesg?

purp1e-ace · 2025-08-11T05:17:50Z

Did you export MC_FORCE_TCP before running sglang? @purp1e-ace

Yes, this error only occurs when after I export MC_FORCE_TCP=1

You generate more mooncake logs by setting export MC_LOG_LEVEL=TRACE. On the other hand, I see a segmentation fault; could you check the system log using dmesg?

The logs remain the same after I export MC_LOG_LEVEL=TRACE. Also, because I'm working on a k8s pod, I could not read kernel buffer. Do you have other advice? Is there anything else I can do?

wl-lei · 2025-10-16T13:01:46Z

Did you export MC_FORCE_TCP before running sglang? @purp1e-ace

Yes, this error only occurs when after I export MC_FORCE_TCP=1

You generate more mooncake logs by setting export MC_LOG_LEVEL=TRACE. On the other hand, I see a segmentation fault; could you check the system log using dmesg?

The logs remain the same after I export MC_LOG_LEVEL=TRACE. Also, because I'm working on a k8s pod, I could not read kernel buffer. Do you have other advice? Is there anything else I can do?

Did you solve the problem? I have the same.

* Re-implement vram support * Test logging * Remove CUDA logging line * Add comments * Avoid memcpy if addr is dram

alogfans added 3 commits July 31, 2025 12:49

Re-implement vram support

4410e84

Test logging

aec85c0

Remove CUDA logging line

de700f2

staryxchen requested changes Aug 1, 2025

View reviewed changes

Comment thread mooncake-transfer-engine/src/transport/tcp_transport/tcp_transport.cpp Outdated

Comment thread mooncake-transfer-engine/src/transfer_engine.cpp

Add comments

222977f

alogfans mentioned this pull request Aug 1, 2025

[TransferEngine] Tcp Transport supporting vram data transfer (#602) #609

Closed

staryxchen approved these changes Aug 1, 2025

View reviewed changes

Avoid memcpy if addr is dram

e97b1fe

stmatengss reviewed Aug 4, 2025

View reviewed changes

stmatengss approved these changes Aug 4, 2025

View reviewed changes

ShangmingCai approved these changes Aug 6, 2025

View reviewed changes

ShangmingCai merged commit 2eadcef into kvcache-ai:main Aug 6, 2025
10 checks passed

fungaren mentioned this pull request Sep 4, 2025

[Bug] failure on simplest PD Disaggregation with error : Failed to send kv chunk sgl-project/sglang#7118

Closed

5 tasks

ShangmingCai mentioned this pull request Sep 15, 2025

[Bug] Facing random segmentaion faults on nodes 2P6D and above sgl-project/sglang#6713

Closed

5 tasks

ShangmingCai mentioned this pull request Nov 5, 2025

[Roadmap] Distributed Serving Enhancement on 2025 H2 sgl-project/sglang#8210

Closed

22 tasks

wanyue-wy pushed a commit to wanyue-wy/Mooncake that referenced this pull request Dec 14, 2025

Reimplement VRAM buffering in TCP transport (kvcache-ai#702)

8420f29

* Re-implement vram support * Test logging * Remove CUDA logging line * Add comments * Avoid memcpy if addr is dram

JasonZhang517 pushed a commit to JasonZhang517/Mooncake that referenced this pull request Feb 9, 2026

Reimplement VRAM buffering in TCP transport (kvcache-ai#702)

2082f3e

* Re-implement vram support * Test logging * Remove CUDA logging line * Add comments * Avoid memcpy if addr is dram

Conversation

alogfans commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

staryxchen left a comment

Choose a reason for hiding this comment

Uh oh!

stmatengss Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

alogfans Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss left a comment

Choose a reason for hiding this comment

Uh oh!

ZhenshengWu commented Aug 6, 2025

Uh oh!

ShangmingCai commented Aug 6, 2025

Uh oh!

Uh oh!

purp1e-ace commented Aug 8, 2025

Uh oh!

stmatengss commented Aug 9, 2025

Uh oh!

stmatengss commented Aug 9, 2025

Uh oh!

purp1e-ace commented Aug 9, 2025

Uh oh!

stmatengss commented Aug 9, 2025

Uh oh!

purp1e-ace commented Aug 11, 2025

Uh oh!

wl-lei commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants