Nixl async transfer#23967
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
a50ea89 to
616ca55
Compare
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
616ca55 to
28b6504
Compare
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
@ishandhanani @iyastreb could you please help with review? It's the same PR as #20680 but with conflicts resolved (and fixing the P>D issue from main) |
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
bf1059d to
66f674d
Compare
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
| except _NIXL_TRANSPORT_ERRORS as e: | ||
| logger.warning( | ||
| f"KVSender check_xfer_state failed for room {self.bootstrap_room}: {e}" | ||
| ) | ||
| self._send_failed = True | ||
| self._send_error = e | ||
| return KVPoll.Failed # type: ignore | ||
| if all(x == "DONE" for x in states): | ||
| if ( | ||
| self._transfer_start_time is not None | ||
| and self._transfer_metric.transfer_latency_s is None | ||
| ): | ||
| self._transfer_metric.transfer_latency_s = ( | ||
| time.perf_counter() - self._transfer_start_time | ||
| ) | ||
| return KVPoll.Success # type: ignore | ||
| if any(x == "ERR" for x in states): | ||
| self._send_failed = True | ||
| self._send_error = RuntimeError( | ||
| f"NIXL transfer error for room {self.bootstrap_room}" |
There was a problem hiding this comment.
It's a good point. I have now changed the code to catch exceptions in the worker thread, pass them to the main thread and raise from there, so that we can detect _NIXL_TRANSPORT_ERRORS as before. The worker thread still has to catch all exceptions otherwise it may die in case of other errors, which may cause hangs
ShangmingCai
left a comment
There was a problem hiding this comment.
Overall LGTM, but why remove _NIXL_TRANSPORT_ERRORS? I remember this was just added a short while ago.
…after bootstrap) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Taken over from #20680
Motivation
This PR improves the performance of NixlKVManager by making KV transfer asynchronous and multi-threaded on the prefill node. Previously,
add_transfer_requestperformed each chunk transfer synchronously and the caller (NixlKVSender) had to track and poll all transfer handles. With many decode instances and chunked transfers, this caused the prefill scheduler to block on transfer completion and limited throughput. This change aligns NIXL with the queue-based, multi-worker transfer design.Performance
We ran Qwen3-32B PD disaggregation with NIXL and observed a clear improvement in transfer latency via NIXL telemetry:
Async multi-worker transfer removes the synchronous bottleneck on the prefill path: chunks are processed in parallel by worker threads, and decode instances are sharded across queues for better overlap, which explains the lower mean and significantly improved tail (P95/P99) latency.
Modifications
Async transfer with queue + worker pool (PREFILL mode)
FastQueueinstances (count controlled bySGLANG_DISAGGREGATION_QUEUE_SIZE) and aThreadPoolExecutorper queue (total worker count fromSGLANG_DISAGGREGATION_THREAD_POOL_SIZE).TransferKVChunkdataclass and daemontransfer_workerthreads that consume chunks from the queues and executesend_kvcache/send_kvcache_slice,maybe_send_extra, andsend_auxin the worker.min(max(4, (0.5 * cpu_count) // 8), 12)when the env var is not set; queue size defaults to env (e.g. 4).Non-blocking
add_transfer_requestadd_transfer_requestno longer performs transfer inline; it enqueues aTransferKVChunktotransfer_queues[bootstrap_room % len(transfer_queues)]and returnsNone.request_status(e.g.Transferring,Success,Failed), so the sender no longer needs to hold or poll transfer handles.NixlKVSender simplifications
xfer_handles;poll()now relies onkv_mgr.check_status(bootstrap_room)only.clear()to removebootstrap_roomfromrequest_statuswhen appropriate.request_statusin the sender; the worker clearstransfer_infosand sets status toSuccesswhen the last chunk is done.Scheduler handling of Bootstrapping
prefill.py, requests inKVPoll.Bootstrappingare now treated as undone (together withWaitingForInputandTransferring) so the scheduler does not consider them complete before transfer progress.Testing
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --host 127.0.0.1 --port 8000: Accuracy: 0.945 with Qwen/Qwen3-8BTestDisaggregationAccuracypasses with NIXL (score 0.76, throughput 3949 token/s)Checklist
test_disaggregation_basic.py; 7 tests passed.)Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci)