support non-disturbing remote-instance-weight-loader#13125
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amysaq2023 Hi, Can you share some perf data about |
|
@amysaq2023 @stmatengss Hi, in our tests(2 nodes of h200*8 with deepseek-v3(645GB)), the time costed in |
Performance tested with 8*H20 on loading
|
c003071 to
6155a38
Compare
|
Nice PR anqi! Sorry for waiting so long. |
zhaochenyang20
left a comment
There was a problem hiding this comment.
You are my hero! As solid as always. Sorry for waiting so long for the review. I shall also connect Mooncake team and MicroSoft AI for review.
@amysaq2023 thanks for your infos. Can you tell how to config hardwares: 2 node of h200*8, with 8 * 400Gb/s NICs(mlx5_1~mlx5_8), 1 * 100Gb/s NIC(mlx5_0)
softwars: mooncake-transfer-engine 0.3.7.post2
base envs: export MOONCAKE_PROTOCOL="rdma"and tested it with 3 settings, And its' perf as belows. Looking forward to your reply, thanks. 1. don't set MOONCAKE_DEVICE, register_memory cost: 70.6599s, batch_transfer_sync_read: 6.3640s
2. export MOONCAKE_DEVICE='mlx5_0', register_memory cost: 8.7188s, batch_transfer_sync_read: 48.9649s
3. export MOONCAKE_DEVICE='mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8', register_memory cost: 60.5853s, batch_transfer_sync_read: 6.1010snvidia-smi topo -m as belows GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PHB PIX PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PHB PHB PIX PHB SYS SYS SYS SYS 0-89 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS PHB PHB PHB PIX SYS SYS SYS SYS 0-89 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX PHB PHB PHB 90-179 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS PHB PIX PHB PHB 90-179 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS PHB PHB PIX PHB 90-179 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS PHB PHB PHB PIX 90-179 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS
NIC2 PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS
NIC3 PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS
NIC4 PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 SYS SYS SYS SYS PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 SYS SYS SYS SYS PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 SYS SYS SYS SYS PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8 |
|
super cool! Just for my understanding, the |
6155a38 to
88081c6
Compare
Could you please provide more information about how you set-up the test such as whether running SGLang instance in container and which SGLang image is used? Thanks. |
| @@ -0,0 +1,55 @@ | |||
| # R-Fork | |||
|
|
|||
| R-Fork (Tensor Remote Fork) provides a novel weight loading methodology that leverages efficient inter-node GPU-to-GPU data transfer path to load tensors from a running SGLang instance to a new instance with zero-copy. It can significantly optimize the SGLang instance boot-up time by reducing model weights loading from several minutes to mere seconds. | |||
There was a problem hiding this comment.
😂 I find that the docs are merely the same as the blog post. So we can just leave a basic introduction, link to our blog, say what R-Fork can do with detailed parameter explanations, and give instructions on usage here.
f2afd8c to
2fa8561
Compare
|
/rerun-failed-ci |
|
/rerun-failed-ci |
…ader This commit reduces the time cost on registering memory regions for using TransferEngine as backend to loading weights from remote instance, by merging continuous memory blocks into one memory region. Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
2fa8561 to
0b450ad
Compare
|
/rerun-failed-ci |
| pipe_writer.send( | ||
| { | ||
| "status": "ready", | ||
| "max_total_num_tokens": scheduler.max_total_num_tokens, | ||
| "max_req_input_len": scheduler.max_req_input_len, | ||
| "tp_rank": tp_rank, | ||
| "remote_instance_transfer_engine_session_id": remote_instance_transfer_engine_session_id, | ||
| "remote_instance_transfer_engine_weights_info_dict": remote_instance_transfer_engine_weights_info_dict, | ||
| } | ||
| ) | ||
| else: | ||
| pipe_writer.send( | ||
| { | ||
| "status": "ready", | ||
| "max_total_num_tokens": scheduler.max_total_num_tokens, | ||
| "max_req_input_len": scheduler.max_req_input_len, | ||
| } | ||
| ) |
There was a problem hiding this comment.
Do not duplicate the code here.
These lines are duplicated.
"status": "ready",
"max_total_num_tokens": scheduler.max_total_num_tokens,
"max_req_input_len": scheduler.max_req_input_len,
Do something like this.
if ...
result_dict.update({
"remote_instance_transfer_engine_session_id": remote_instance_transfer_engine_session_id,
"remote_instance_transfer_engine_weights_info_dict": remote_instance_transfer_engine_weights_info_dict,
})
| template_manager, | ||
| scheduler_info, | ||
| port_args, | ||
| remote_instance_transfer_engine_info, |
There was a problem hiding this comment.
This is redundant. All data in remote_instance_transfer_engine_info is alrady in scheduler_info.
Can you revert the changes here?
i.e., do not change the interface of _launch_subprocesses
|
@amysaq2023 It conflicts with some refactors here #14869. Would be great if you can keep the interface of |
Sure. Working on refactoring this. |
|
This PR broke the B200 CI. We need to be more careful during review and before merging PRs — the main branch has been breaking too frequently lately. |
|
ref #14958 |
This commit address comments in sgl-project#13125 (comment), sgl-project#13125 (comment) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
This commit address comments in sgl-project#13125 (comment), sgl-project#13125 (comment) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
PR addressed above comments: #14971 |
|
cause: |
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Signed-off-by: Anqi Shen <amy.saq@antgroup.com>
Motivation
In #8215, SGLang has already supported a new load format: remote_instance, that allows new instance to load weights from another running instance. This approach can greatly improve weight loading time during instance initialization. However, since it use torch.distributed with NCCL as backend, it will disturb on-going inference requests: torch.distributed will always launch CUDA kernels for transferring weight tensors.
We come up with another backend option: TransferEngine, which will not disturbing any GPU workload and still, using RDMA to transfer weights.
Modifications
We initialize one TransferEngine for each ModelRunner and will register its weights to RDMA channel during initialization.
When initializing a new instance who wants to use
remote_instanceload_format with TransferEngine backend:How to use:
python -m sglang.launch_server [args] \--load-format remote_instance \--remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \--remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \--remote-instance-weight-loader-backend "transfer_engine"Checklist