[Scheduler] Defer prefill input_ids H2D to forward stream, unify resolve via future_map#25945
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
c9d3cea to
c9153da
Compare
b5a0eb9 to
2be0b40
Compare
4 tasks
…fwd-stream # Conflicts: # python/sglang/srt/managers/overlap_utils.py # python/sglang/srt/managers/scheduler.py
…ces at stash time
…h (shape mismatch)
…lls stale staging (BUG-3)
…may be None pre-worker)
…fwd-stream # Conflicts: # python/sglang/srt/managers/overlap_utils.py # python/sglang/srt/managers/schedule_batch.py
…fwd-stream # Conflicts: # python/sglang/srt/managers/schedule_batch.py # python/sglang/srt/managers/scheduler.py
Collaborator
Author
This was referenced May 30, 2026
5 tasks
5 tasks
arathi-hlab
pushed a commit
to arathi-hlab/sglang
that referenced
this pull request
Jun 2, 2026
…H2D copy - Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install - Remove torchao from XPU Docker image (not needed/supported on XPU) - Materialize deferred H2D input_ids copy in bench_one_batch extend() so bench bypassing the scheduler still works after sgl-project#25945
xjpang
pushed a commit
to xjpang/sglang
that referenced
this pull request
Jun 2, 2026
…lve via future_map (sgl-project#25945)
arathi-hlab
added a commit
to arathi-hlab/sglang
that referenced
this pull request
Jun 2, 2026
…H2D copy - Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install - Remove torchao from XPU Docker image (not needed/supported on XPU) - Materialize deferred H2D input_ids copy in bench_one_batch extend() so bench bypassing the scheduler still works after sgl-project#25945
arathi-hlab
added a commit
to arathi-hlab/sglang
that referenced
this pull request
Jun 2, 2026
…H2D copy - Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install - Remove torchao from XPU Docker image (not needed/supported on XPU) - Materialize deferred H2D input_ids copy in bench_one_batch extend() so bench bypassing the scheduler still works after sgl-project#25945
mqhc2020
pushed a commit
to mqhc2020/sglang
that referenced
this pull request
Jun 2, 2026
…lve via future_map (sgl-project#25945)
hanming-lu
pushed a commit
that referenced
this pull request
Jun 3, 2026
…lve via future_map (#25945)
alphabetc1
pushed a commit
to alphabetc1/sglang
that referenced
this pull request
Jun 4, 2026
…lve via future_map (sgl-project#25945)
jeynmann
pushed a commit
to jeynmann/sglang
that referenced
this pull request
Jun 4, 2026
…lve via future_map (sgl-project#25945)
edwingao28
pushed a commit
to edwingao28/sglang
that referenced
this pull request
Jun 7, 2026
…lve via future_map (sgl-project#25945)
monkeyLoveding
pushed a commit
to monkeyLoveding/sglang_open
that referenced
this pull request
Jun 9, 2026
…lve via future_map (sgl-project#25945)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
input_idsH2D from schedule_stream to forward_stream, and unifyinput_idsmaterialization across overlap and non-overlap:input_idsis leftNoneduring scheduling and resolved at forward entry through the always-onFutureMaprelay.Mechanism
prepare_for_extendstages prefill prompt tokens as pinned CPU (prefill_input_ids_cpu) instead of building the GPU tensor.FutureMap.output_tokens_buf(stash on forward, gather next iter).resolve_forward_inputsmaterializesinput_idsat forward entry on the forward stream: H2D for prefill, gather for decode,catfor mixed.forward_stream_ctxwraps the overlap isolation, withresolve_forward_inputssitting between the stream barrier and isolation.Coverage (all forward modes routed through the relay)
pp_outputsnext_token_ids for next-iter gatherresolve_forward_inputsbefore forwardprefill_input_ids_cpuafter encoder strippingPREBUILT(non-spec)merge_batch: tolerateNoneinput_idson either side (fall back to relay gather)Spec / non-overlap
req_pool_indices, so keep the directbatch.input_idsassignment and skip the relay stash (dispatch by payload type).Fixes
input_ids=None— cumulate fromReq.output_idsinsteadbatch.deviceinstead ofbatch.input_ids.device(may beNonepre-worker)Cleanup
forward_stream_ctxunconditionally (PP non-overlap also uses it)flatten_arrays_to_pinned_cpuhelper;_gpunaming; read CI flag viaenvsCI States
Latest PR Test (Base): 🚫 Run #26680708733
Latest PR Test (Extra): ✅ Run #26680708677