support non-disturbing remote-instance-weight-loader by amysaq2023 · Pull Request #13125 · sgl-project/sglang

amysaq2023 · 2025-11-12T05:55:13Z

Motivation

In #8215, SGLang has already supported a new load format: remote_instance, that allows new instance to load weights from another running instance. This approach can greatly improve weight loading time during instance initialization. However, since it use torch.distributed with NCCL as backend, it will disturb on-going inference requests: torch.distributed will always launch CUDA kernels for transferring weight tensors.

We come up with another backend option: TransferEngine, which will not disturbing any GPU workload and still, using RDMA to transfer weights.

Modifications

We initialize one TransferEngine for each ModelRunner and will register its weights to RDMA channel during initialization.

When initializing a new instance who wants to use remote_instance load_format with TransferEngine backend:

It will send an HTTP request to retrieve the source instance's TransferEngine metadata, including RDMA keys mapped to the corresponding GPU memory addresses.
Using these RDMA keys, the new instance directly loads weights from the source's GPU memory.

How to use:
python -m sglang.launch_server [args] \
--load-format remote_instance \
--remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \
--remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \
--remote-instance-weight-loader-backend "transfer_engine"

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-12T05:55:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

amysaq2023 · 2025-11-13T06:24:40Z

cc @tianyuzhou95 @zhaochenyang20

XiaotaoChen · 2025-11-17T15:17:39Z

@amysaq2023 Hi, Can you share some perf data about remote_instance with Transfer_engine. I tried remote_instance with nccl with deepseek-v3(645GB), it synced within 3s on h200*8 with 8 NIC. Meantime, The on-going inference tasks would be blocked.
And there are another question. How should we allocate the NICs for GPU collective operations, KVcache transfer of PD disaggregation and remote_instance with TE? In Our practices (9 NICs)， we use 8 NICs(assume ib0-7) pinned with GPUs for GPU collective operations, And 1 NIC(assume ib8) for KVcache transfer of PD disaggregation. And there is no NIC for remote_instance.
if we allocate one of ib0-7, it would disturbe inference tasks; if allocate ib8 for remote_instance, it also disturbe KVCache Transfer, which also disturbe inference tasks in PD disaggregation. So how should we allocate NICs for remote_instance to minimize disturbing the on-going inference tasks. default allocate all NICs, let's MoonCake or System process it? And Can we limit the bandwidth of remote_instance , such as only use 1/4, 1/2 bandwidth?

XiaotaoChen · 2025-11-19T09:47:03Z

@amysaq2023 @stmatengss Hi, in our tests(2 nodes of h200*8 with deepseek-v3(645GB)), the time costed in register_memory stage is similar to the time taken by batch_transfer_sync_read. such as register_memory takes ~45s, and batch_transfer_sync_read takes ~40s. how can we optimize it？ I try split the whole state_dict into 8 parts, and launch threads to run register and batch_sync. but it's useless.

amysaq2023 · 2025-11-27T01:15:02Z

@amysaq2023 Hi, Can you share some perf data about remote_instance with Transfer_engine. I tried remote_instance with nccl with deepseek-v3(645GB), it synced within 3s on h200*8 with 8 NIC. Meantime, The on-going inference tasks would be blocked. And there are another question. How should we allocate the NICs for GPU collective operations, KVcache transfer of PD disaggregation and remote_instance with TE? In Our practices (9 NICs)， we use 8 NICs(assume ib0-7) pinned with GPUs for GPU collective operations, And 1 NIC(assume ib8) for KVcache transfer of PD disaggregation. And there is no NIC for remote_instance. if we allocate one of ib0-7, it would disturbe inference tasks; if allocate ib8 for remote_instance, it also disturbe KVCache Transfer, which also disturbe inference tasks in PD disaggregation. So how should we allocate NICs for remote_instance to minimize disturbing the on-going inference tasks. default allocate all NICs, let's MoonCake or System process it? And Can we limit the bandwidth of remote_instance , such as only use 1/4, 1/2 bandwidth?

Performance tested with 8*H20 on loading Deepseek-V3:

register memory time: ~22s (~7s after WIP optimizing)
transfer weights time: ~3s

zhaochenyang20 · 2025-11-28T01:01:17Z

Nice PR anqi! Sorry for waiting so long.

zhaochenyang20

You are my hero! As solid as always. Sorry for waiting so long for the review. I shall also connect Mooncake team and MicroSoft AI for review.

XiaotaoChen · 2025-11-28T10:49:37Z

@amysaq2023 Hi, Can you share some perf data about remote_instance with Transfer_engine. I tried remote_instance with nccl with deepseek-v3(645GB), it synced within 3s on h200*8 with 8 NIC. Meantime, The on-going inference tasks would be blocked. And there are another question. How should we allocate the NICs for GPU collective operations, KVcache transfer of PD disaggregation and remote_instance with TE? In Our practices (9 NICs)， we use 8 NICs(assume ib0-7) pinned with GPUs for GPU collective operations, And 1 NIC(assume ib8) for KVcache transfer of PD disaggregation. And there is no NIC for remote_instance. if we allocate one of ib0-7, it would disturbe inference tasks; if allocate ib8 for remote_instance, it also disturbe KVCache Transfer, which also disturbe inference tasks in PD disaggregation. So how should we allocate NICs for remote_instance to minimize disturbing the on-going inference tasks. default allocate all NICs, let's MoonCake or System process it? And Can we limit the bandwidth of remote_instance , such as only use 1/4, 1/2 bandwidth?

Performance tested with 8*H20 on loading Deepseek-V3:

register memory time: ~22s (~7s after WIP optimizing)

transfer weights time: ~3s

@amysaq2023 thanks for your infos. Can you tell how to config MOONCAKE_DEVICE to reproduce the perf ? I tried this pr in our envs.

hardwares: 2 node of h200*8， with 8 * 400Gb/s NICs(mlx5_1~mlx5_8), 1 * 100Gb/s NIC(mlx5_0)
softwars: mooncake-transfer-engine                 0.3.7.post2
base envs:  export MOONCAKE_PROTOCOL="rdma"

and tested it with 3 settings, And its' perf as belows. Looking forward to your reply, thanks.

1. don't set MOONCAKE_DEVICE, register_memory cost: 70.6599s, batch_transfer_sync_read: 6.3640s
2. export MOONCAKE_DEVICE='mlx5_0', register_memory cost: 8.7188s, batch_transfer_sync_read: 48.9649s
3. export MOONCAKE_DEVICE='mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8', register_memory cost: 60.5853s, batch_transfer_sync_read: 6.1010s

nvidia-smi topo -m as belows

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     90-179  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     90-179  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     90-179  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     90-179  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

JD-ETH · 2025-12-02T04:58:30Z

super cool! Just for my understanding, the batch_transfer_sync_read call is a single-sided write operation, and does not require additional TransferEngine instance at the target entity, correct?

amysaq2023 · 2025-12-07T06:15:58Z

@amysaq2023 Hi, Can you share some perf data about remote_instance with Transfer_engine. I tried remote_instance with nccl with deepseek-v3(645GB), it synced within 3s on h200*8 with 8 NIC. Meantime, The on-going inference tasks would be blocked. And there are another question. How should we allocate the NICs for GPU collective operations, KVcache transfer of PD disaggregation and remote_instance with TE? In Our practices (9 NICs)， we use 8 NICs(assume ib0-7) pinned with GPUs for GPU collective operations, And 1 NIC(assume ib8) for KVcache transfer of PD disaggregation. And there is no NIC for remote_instance. if we allocate one of ib0-7, it would disturbe inference tasks; if allocate ib8 for remote_instance, it also disturbe KVCache Transfer, which also disturbe inference tasks in PD disaggregation. So how should we allocate NICs for remote_instance to minimize disturbing the on-going inference tasks. default allocate all NICs, let's MoonCake or System process it? And Can we limit the bandwidth of remote_instance , such as only use 1/4, 1/2 bandwidth?

Performance tested with 8*H20 on loading Deepseek-V3:

register memory time: ~22s (~7s after WIP optimizing)

transfer weights time: ~3s

@amysaq2023 thanks for your infos. Can you tell how to config MOONCAKE_DEVICE to reproduce the perf ? I tried this pr in our envs.

hardwares: 2 node of h200*8， with 8 * 400Gb/s NICs(mlx5_1~mlx5_8), 1 * 100Gb/s NIC(mlx5_0)
softwars: mooncake-transfer-engine                 0.3.7.post2
base envs:  export MOONCAKE_PROTOCOL="rdma"

and tested it with 3 settings, And its' perf as belows. Looking forward to your reply, thanks.

1. don't set MOONCAKE_DEVICE, register_memory cost: 70.6599s, batch_transfer_sync_read: 6.3640s
2. export MOONCAKE_DEVICE='mlx5_0', register_memory cost: 8.7188s, batch_transfer_sync_read: 48.9649s
3. export MOONCAKE_DEVICE='mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8', register_memory cost: 60.5853s, batch_transfer_sync_read: 6.1010s

nvidia-smi topo -m as belows

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     90-179  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     90-179  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     90-179  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     90-179  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

Could you please provide more information about how you set-up the test such as whether running SGLang instance in container and which SGLang image is used? Thanks.

zhaochenyang20 · 2025-12-10T00:42:10Z

@@ -0,0 +1,55 @@
+# R-Fork
+
+R-Fork (Tensor Remote Fork) provides a novel weight loading methodology that leverages efficient inter-node GPU-to-GPU data transfer path to load tensors from a running SGLang instance to a new instance with zero-copy. It can significantly optimize the SGLang instance boot-up time by reducing model weights loading from several minutes to mere seconds.


😂 I find that the docs are merely the same as the blog post. So we can just leave a basic introduction, link to our blog, say what R-Fork can do with detailed parameter explanations, and give instructions on usage here.

stmatengss · 2025-12-10T16:34:11Z

/rerun-failed-ci

zhaochenyang20 · 2025-12-10T21:23:44Z

/rerun-failed-ci

…ader This commit reduces the time cost on registering memory regions for using TransferEngine as backend to loading weights from remote instance, by merging continuous memory blocks into one memory region. Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

zhaochenyang20 · 2025-12-11T18:22:28Z

/rerun-failed-ci

…ader

merrymercy · 2025-12-12T01:23:38Z

+            pipe_writer.send(
+                {
+                    "status": "ready",
+                    "max_total_num_tokens": scheduler.max_total_num_tokens,
+                    "max_req_input_len": scheduler.max_req_input_len,
+                    "tp_rank": tp_rank,
+                    "remote_instance_transfer_engine_session_id": remote_instance_transfer_engine_session_id,
+                    "remote_instance_transfer_engine_weights_info_dict": remote_instance_transfer_engine_weights_info_dict,
+                }
+            )
+        else:
+            pipe_writer.send(
+                {
+                    "status": "ready",
+                    "max_total_num_tokens": scheduler.max_total_num_tokens,
+                    "max_req_input_len": scheduler.max_req_input_len,
+                }
+            )


Do not duplicate the code here.
These lines are duplicated.

"status": "ready", "max_total_num_tokens": scheduler.max_total_num_tokens, "max_req_input_len": scheduler.max_req_input_len,

Do something like this.

if ... result_dict.update({ "remote_instance_transfer_engine_session_id": remote_instance_transfer_engine_session_id, "remote_instance_transfer_engine_weights_info_dict": remote_instance_transfer_engine_weights_info_dict, })

merrymercy · 2025-12-12T01:37:41Z

+        template_manager,
+        scheduler_info,
+        port_args,
+        remote_instance_transfer_engine_info,


This is redundant. All data in remote_instance_transfer_engine_info is alrady in scheduler_info.
Can you revert the changes here?

i.e., do not change the interface of _launch_subprocesses

merrymercy · 2025-12-12T01:39:29Z

@amysaq2023 It conflicts with some refactors here #14869. Would be great if you can keep the interface of _launch_subprocesses unchanged.

amysaq2023 · 2025-12-12T02:31:45Z

@amysaq2023 It conflicts with some refactors here #14869. Would be great if you can keep the interface of _launch_subprocesses unchanged.

Sure. Working on refactoring this.

This reverts commit 70758d4.

zhyncs · 2025-12-12T07:24:34Z

This PR broke the B200 CI. We need to be more careful during review and before merging PRs — the main branch has been breaking too frequently lately.
@amysaq2023 @zhaochenyang20 @Kangyan-Zhou

zhyncs · 2025-12-12T07:35:09Z

ref #14958

This commit address comments in sgl-project#13125 (comment), sgl-project#13125 (comment) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 · 2025-12-12T09:30:24Z

@amysaq2023 It conflicts with some refactors here #14869. Would be great if you can keep the interface of _launch_subprocesses unchanged.

Sure. Working on refactoring this.

PR addressed above comments: #14971
@merrymercy @zhaochenyang20

weireweire · 2025-12-12T09:58:27Z

cause:

  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2720, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 320, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 248, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 363, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 448, in initialize
    register_memory_region_v2(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/remote_instance_weight_loader_utils.py", line 176, in register_memory_region_v2
    ret = transfer_engine.register_memory(address, size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'register_memory'

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 requested review from CatherineSue, Fridge003, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, slin1237, xiezhq-hermann and zhyncs as code owners November 12, 2025 05:55

stmatengss added the run-ci label Nov 12, 2025

zhaochenyang20 mentioned this pull request Nov 13, 2025

[Feature] RFC: Add a planner to manage seed instances for loading weights from remote instances #12910

Closed

2 tasks

XiaotaoChen mentioned this pull request Nov 16, 2025

[Bug] non-disturbing remote-instance-loader raise E1116 Worker: Process failed for slice #13380

Closed

5 tasks

stmatengss reviewed Nov 17, 2025

View reviewed changes

Comment thread python/sglang/srt/model_executor/model_runner.py

XiaotaoChen reviewed Nov 19, 2025

View reviewed changes

Comment thread python/sglang/srt/model_loader/loader.py Outdated

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader branch from c003071 to 6155a38 Compare November 27, 2025 15:36

zhaochenyang20 requested changes Nov 28, 2025

View reviewed changes

Comment thread python/sglang/srt/entrypoints/http_server.py Outdated

Comment thread python/sglang/srt/model_loader/loader.py Outdated

Comment thread python/sglang/srt/model_loader/loader.py Outdated

JD-ETH mentioned this pull request Dec 2, 2025

[Feature] Support p2p rdma weight transfer in RL training #14170

Open

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader branch from 6155a38 to 88081c6 Compare December 3, 2025 15:36

XiaotaoChen mentioned this pull request Dec 7, 2025

[Performance]: the performance of register memory varies significantly between two machines with identical configurations, how to pinpoint it ? kvcache-ai/Mooncake#1175

Open

1 task

stmatengss requested a review from XiaotaoChen December 7, 2025 06:10

zhaochenyang20 mentioned this pull request Dec 9, 2025

add rfork blog lm-sys/lm-sys.github.io#267

Merged

1 task

zhaochenyang20 requested changes Dec 10, 2025

View reviewed changes

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader branch 3 times, most recently from f2afd8c to 2fa8561 Compare December 10, 2025 14:30

zhaochenyang20 requested changes Dec 10, 2025

View reviewed changes

Comment thread python/sglang/srt/model_loader/loader.py

Comment thread python/sglang/srt/model_loader/loader.py

amysaq2023 added 2 commits December 11, 2025 16:55

add document for R-Fork

0b450ad

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 force-pushed the amy/non-disturbing-remote-instance-weight-loader branch from 2fa8561 to 0b450ad Compare December 11, 2025 08:59

zhaochenyang20 approved these changes Dec 11, 2025

View reviewed changes

Merge branch 'main' into amy/non-disturbing-remote-instance-weight-lo…

97b336a

…ader

Kangyan-Zhou merged commit 70758d4 into sgl-project:main Dec 12, 2025
124 of 156 checks passed

merrymercy reviewed Dec 12, 2025

View reviewed changes

ZailiWang mentioned this pull request Dec 12, 2025

[Fix] suppress remote weight loading engine w/o mooncake installed #14937

Merged

zhyncs added a commit that referenced this pull request Dec 12, 2025

Revert "support non-disturbing remote-instance-weight-loader (#13125)"

a224172

This reverts commit 70758d4.

amysaq2023 added a commit to amysaq2023/sglang that referenced this pull request Dec 12, 2025

refactor remote instance transfer engine info parsing

5073fe7

This commit address comments in sgl-project#13125 (comment), sgl-project#13125 (comment) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

amysaq2023 mentioned this pull request Dec 12, 2025

refactor remote instance info dict #14971

Closed

6 tasks

amysaq2023 added a commit to amysaq2023/sglang that referenced this pull request Dec 12, 2025

refactor remote instance transfer engine info parsing

1562ee0

This commit address comments in sgl-project#13125 (comment), sgl-project#13125 (comment) Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

support non-disturbing remote-instance-weight-loader (sgl-project#13125)

501156e

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

support non-disturbing remote-instance-weight-loader (sgl-project#13125)

9955ec7

Signed-off-by: Anqi Shen <amy.saq@antgroup.com>

		@@ -0,0 +1,55 @@
		# R-Fork

		R-Fork (Tensor Remote Fork) provides a novel weight loading methodology that leverages efficient inter-node GPU-to-GPU data transfer path to load tensors from a running SGLang instance to a new instance with zero-copy. It can significantly optimize the SGLang instance boot-up time by reducing model weights loading from several minutes to mere seconds.

Conversation

amysaq2023 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 12, 2025

Uh oh!

amysaq2023 commented Nov 13, 2025

Uh oh!

Uh oh!

XiaotaoChen commented Nov 17, 2025

Uh oh!

Uh oh!

XiaotaoChen commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amysaq2023 commented Nov 27, 2025

Uh oh!

zhaochenyang20 commented Nov 28, 2025

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XiaotaoChen commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JD-ETH commented Dec 2, 2025

Uh oh!

amysaq2023 commented Dec 7, 2025

Uh oh!

zhaochenyang20 Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Dec 10, 2025

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Dec 10, 2025

Uh oh!

zhaochenyang20 commented Dec 11, 2025

Uh oh!

Uh oh!

merrymercy Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amysaq2023 commented Dec 12, 2025

Uh oh!

zhyncs commented Dec 12, 2025

Uh oh!

zhyncs commented Dec 12, 2025

Uh oh!

amysaq2023 commented Dec 12, 2025

Uh oh!

weireweire commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

amysaq2023 commented Nov 12, 2025 •

edited

Loading

XiaotaoChen commented Nov 19, 2025 •

edited

Loading

XiaotaoChen commented Nov 28, 2025 •

edited

Loading

merrymercy commented Dec 12, 2025 •

edited

Loading