Skip to content

[Scheduler] Defer prefill input_ids H2D to forward stream, unify resolve via future_map#25945

Merged
hnyls2002 merged 27 commits into
mainfrom
lsyin/prefill-h2d-on-fwd-stream
May 30, 2026
Merged

[Scheduler] Defer prefill input_ids H2D to forward stream, unify resolve via future_map#25945
hnyls2002 merged 27 commits into
mainfrom
lsyin/prefill-h2d-on-fwd-stream

Conversation

@hnyls2002

@hnyls2002 hnyls2002 commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Defer prefill input_ids H2D from schedule_stream to forward_stream, and unify input_ids materialization across overlap and non-overlap: input_ids is left None during scheduling and resolved at forward entry through the always-on FutureMap relay.

Mechanism

  • prepare_for_extend stages prefill prompt tokens as pinned CPU (prefill_input_ids_cpu) instead of building the GPU tensor.
  • decode / mixed-chunk relay the last sampled token through FutureMap.output_tokens_buf (stash on forward, gather next iter).
  • resolve_forward_inputs materializes input_ids at forward entry on the forward stream: H2D for prefill, gather for decode, cat for mixed.
  • forward_stream_ctx wraps the overlap isolation, with resolve_forward_inputs sitting between the stream barrier and isolation.

Coverage (all forward modes routed through the relay)

  • PP rank-0 mixed-chunk: stash pp_outputs next_token_ids for next-iter gather
  • embedding / reward: resolve_forward_inputs before forward
  • encoder-decoder extend: rebuild prefill_input_ids_cpu after encoder stripping
  • hisparse rejoin and disagg PREBUILT (non-spec)
  • merge_batch: tolerate None input_ids on either side (fall back to relay gather)

Spec / non-overlap

  • spec_v1 (non-overlap spec): worker shape doesn't match req_pool_indices, so keep the direct batch.input_ids assignment and skip the relay stash (dispatch by payload type).

Fixes

  • penalty path crashed reading input_ids=None — cumulate from Req.output_ids instead
  • spec_v2 isolation snapshot re-installed already-consumed staging
  • spec: read batch.device instead of batch.input_ids.device (may be None pre-worker)

Cleanup

  • init forward_stream_ctx unconditionally (PP non-overlap also uses it)
  • split flatten_arrays_to_pinned_cpu helper; _gpu naming; read CI flag via envs

CI States

Latest PR Test (Base): 🚫 Run #26680708733
Latest PR Test (Extra): ✅ Run #26680708677

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002 hnyls2002 requested a review from Qiaolin-Yu as a code owner May 22, 2026 11:17
…fwd-stream

# Conflicts:
#	python/sglang/srt/managers/schedule_batch.py
#	python/sglang/srt/managers/scheduler.py
@hnyls2002

hnyls2002 commented May 30, 2026

Copy link
Copy Markdown
Collaborator Author

@hnyls2002 hnyls2002 changed the title Defer prefill input_ids H2D to resolve_future [Scheduler] Defer prefill input_ids H2D to forward stream, unify resolve via future_map May 30, 2026
@hnyls2002 hnyls2002 merged commit 282c461 into main May 30, 2026
71 of 147 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/prefill-h2d-on-fwd-stream branch May 30, 2026 09:58
arathi-hlab pushed a commit to arathi-hlab/sglang that referenced this pull request Jun 2, 2026
…H2D copy

- Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install
- Remove torchao from XPU Docker image (not needed/supported on XPU)
- Materialize deferred H2D input_ids copy in bench_one_batch extend() so
  bench bypassing the scheduler still works after sgl-project#25945
xjpang pushed a commit to xjpang/sglang that referenced this pull request Jun 2, 2026
arathi-hlab added a commit to arathi-hlab/sglang that referenced this pull request Jun 2, 2026
…H2D copy

- Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install
- Remove torchao from XPU Docker image (not needed/supported on XPU)
- Materialize deferred H2D input_ids copy in bench_one_batch extend() so
  bench bypassing the scheduler still works after sgl-project#25945
arathi-hlab added a commit to arathi-hlab/sglang that referenced this pull request Jun 2, 2026
…H2D copy

- Add --extra-index-url https://pypi.org/simple/ to xpu.Dockerfile pip install
- Remove torchao from XPU Docker image (not needed/supported on XPU)
- Materialize deferred H2D input_ids copy in bench_one_batch extend() so
  bench bypassing the scheduler still works after sgl-project#25945
mqhc2020 pushed a commit to mqhc2020/sglang that referenced this pull request Jun 2, 2026
hanming-lu pushed a commit that referenced this pull request Jun 3, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
jeynmann pushed a commit to jeynmann/sglang that referenced this pull request Jun 4, 2026
edwingao28 pushed a commit to edwingao28/sglang that referenced this pull request Jun 7, 2026
monkeyLoveding pushed a commit to monkeyLoveding/sglang_open that referenced this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants