Skip to content

Nixl async transfer#23967

Merged
ShangmingCai merged 7 commits intosgl-project:mainfrom
ovidiusm:nixl-async-transfer
May 7, 2026
Merged

Nixl async transfer#23967
ShangmingCai merged 7 commits intosgl-project:mainfrom
ovidiusm:nixl-async-transfer

Conversation

@ovidiusm
Copy link
Copy Markdown
Contributor

@ovidiusm ovidiusm commented Apr 28, 2026

Taken over from #20680

Motivation

This PR improves the performance of NixlKVManager by making KV transfer asynchronous and multi-threaded on the prefill node. Previously, add_transfer_request performed each chunk transfer synchronously and the caller (NixlKVSender) had to track and poll all transfer handles. With many decode instances and chunked transfers, this caused the prefill scheduler to block on transfer completion and limited throughput. This change aligns NIXL with the queue-based, multi-worker transfer design.

Performance

We ran Qwen3-32B PD disaggregation with NIXL and observed a clear improvement in transfer latency via NIXL telemetry:

  • Mean transfer time: 162,225 μs → 41,225 μs (about 4× lower).
  • Distribution: Before, transfer times had high variance with many samples in the 250k–1.2M μs range and a long tail; after the change, the vast majority of samples sit in the 34k–42k μs band with much lower variance and no large outliers.

Async multi-worker transfer removes the synchronous bottleneck on the prefill path: chunks are processed in parallel by worker threads, and decode instances are sharded across queues for better overlap, which explains the lower mean and significantly improved tail (P95/P99) latency.

Modifications

  1. Async transfer with queue + worker pool (PREFILL mode)

    • Introduced multiple FastQueue instances (count controlled by SGLANG_DISAGGREGATION_QUEUE_SIZE) and a ThreadPoolExecutor per queue (total worker count from SGLANG_DISAGGREGATION_THREAD_POOL_SIZE).
    • Added a TransferKVChunk dataclass and daemon transfer_worker threads that consume chunks from the queues and execute send_kvcache / send_kvcache_slice, maybe_send_extra, and send_aux in the worker.
    • Default thread pool size: min(max(4, (0.5 * cpu_count) // 8), 12) when the env var is not set; queue size defaults to env (e.g. 4).
  2. Non-blocking add_transfer_request

    • add_transfer_request no longer performs transfer inline; it enqueues a TransferKVChunk to transfer_queues[bootstrap_room % len(transfer_queues)] and returns None.
    • Workers update request_status (e.g. Transferring, Success, Failed), so the sender no longer needs to hold or poll transfer handles.
  3. NixlKVSender simplifications

    • Removed xfer_handles; poll() now relies on kv_mgr.check_status(bootstrap_room) only.
    • Added clear() to remove bootstrap_room from request_status when appropriate.
    • Last-chunk path no longer deletes request_status in the sender; the worker clears transfer_infos and sets status to Success when the last chunk is done.
  4. Scheduler handling of Bootstrapping

    • In prefill.py, requests in KVPoll.Bootstrapping are now treated as undone (together with WaitingForInput and Transferring) so the scheduler does not consider them complete before transfer progress.

Testing

  • python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --host 127.0.0.1 --port 8000: Accuracy: 0.945 with Qwen/Qwen3-8B
  • TestDisaggregationAccuracy passes with NIXL (score 0.76, throughput 3949 token/s)

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so. (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci)
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ovidiusm ovidiusm marked this pull request as ready for review April 30, 2026 22:52
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ovidiusm
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@ovidiusm ovidiusm requested a review from wisclmy0611 as a code owner May 4, 2026 14:36
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization lora labels May 4, 2026
@ovidiusm ovidiusm force-pushed the nixl-async-transfer branch from a50ea89 to 616ca55 Compare May 4, 2026 14:42
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm ovidiusm force-pushed the nixl-async-transfer branch from 616ca55 to 28b6504 Compare May 5, 2026 09:05
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm
Copy link
Copy Markdown
Contributor Author

ovidiusm commented May 5, 2026

@ishandhanani @iyastreb could you please help with review? It's the same PR as #20680 but with conflicts resolved (and fixing the P>D issue from main)

@ovidiusm
Copy link
Copy Markdown
Contributor Author

ovidiusm commented May 5, 2026

FYI @usernamehaha2022

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm ovidiusm force-pushed the nixl-async-transfer branch from bf1059d to 66f674d Compare May 6, 2026 17:17
@ovidiusm
Copy link
Copy Markdown
Contributor Author

ovidiusm commented May 6, 2026

/tag-and-rerun-ci

@ishandhanani
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ishandhanani
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

Comment thread python/sglang/srt/disaggregation/prefill.py Outdated
Comment on lines -1197 to -1216
except _NIXL_TRANSPORT_ERRORS as e:
logger.warning(
f"KVSender check_xfer_state failed for room {self.bootstrap_room}: {e}"
)
self._send_failed = True
self._send_error = e
return KVPoll.Failed # type: ignore
if all(x == "DONE" for x in states):
if (
self._transfer_start_time is not None
and self._transfer_metric.transfer_latency_s is None
):
self._transfer_metric.transfer_latency_s = (
time.perf_counter() - self._transfer_start_time
)
return KVPoll.Success # type: ignore
if any(x == "ERR" for x in states):
self._send_failed = True
self._send_error = RuntimeError(
f"NIXL transfer error for room {self.bootstrap_room}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @cctry

Copy link
Copy Markdown
Contributor Author

@ovidiusm ovidiusm May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. I have now changed the code to catch exceptions in the worker thread, pass them to the main thread and raise from there, so that we can detect _NIXL_TRANSPORT_ERRORS as before. The worker thread still has to catch all exceptions otherwise it may die in case of other errors, which may cause hangs

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but why remove _NIXL_TRANSPORT_ERRORS? I remember this was just added a short while ago.

ovidiusm added 4 commits May 7, 2026 14:51
…after bootstrap)

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShangmingCai ShangmingCai merged commit 811d138 into sgl-project:main May 7, 2026
56 of 64 checks passed
@ovidiusm ovidiusm deleted the nixl-async-transfer branch May 7, 2026 14:06
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation lora quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants