[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588
Merged
ByronHsu merged 1 commit intosgl-project:mainfrom Apr 23, 2026
Merged
[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588ByronHsu merged 1 commit intosgl-project:mainfrom
ByronHsu merged 1 commit intosgl-project:mainfrom
Conversation
PrefillDelayer is currently gated by `disaggregation_mode == "null"`, so attempting to launch a disaggregated prefill engine with `--enable-prefill-delayer` aborts at scheduler init even though the delayer's negotiation logic (per-iteration all-gather over DP ranks, "all/none/mixed" prefillable status) is exactly what disagg-prefill + DP-attention deployments need to fuse near-simultaneous prefills into a single forward pass instead of "1 real + N idle" passes. This change: - Drops the `disaggregation_mode == "null"` assert in `PrefillDelayer`. - In the scheduler, only constructs the delayer when the engine actually schedules prefills (i.e. `null` or `prefill` modes); on a `decode` engine `--enable-prefill-delayer` is now logged and ignored instead of constructed-and-unused. No behavior change for non-disagg deployments. Existing `prefill_delayer_*` flags work unchanged on the prefill engine of a disaggregated PD setup. Made-with: Cursor
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
ch-wan
approved these changes
Apr 23, 2026
ByronHsu
added a commit
that referenced
this pull request
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PrefillDelayeris currently gated byassert disaggregation_mode == "null"in its constructor, so launching a disaggregated prefill engine with--enable-prefill-delayeraborts at scheduler init. However, the delayer's negotiation logic — per-iterationall_gatherover DP ranks of anall/none/mixedprefillable status, with boundedmax_delay_passeswaiting — is exactly the mechanism a disaggregated-prefill +--enable-dp-attentiondeployment needs to fuse near-simultaneous prefills into a single forward pass instead of paying for repeated "1 real + N idle" passes through the EP all-to-all.In a
--tp 4 --dp 4 --ep 4 --enable-dp-attention --disaggregation-mode prefillsetup, four concurrent client requests routed round-robin to four DP ranks currently end up as two scheduler iterations:With this change,
--enable-prefill-delayercan be used on the prefill engine to delay iteration 1 by a few passes so all four reqs land in the same forward, halving the number of EP all-to-alls and improving prefill throughput at small/medium prompt lengths where the per-pass collective overhead is non-trivial relative to the actual MoE compute.Modifications
python/sglang/srt/managers/prefill_delayer.py: drop thedisaggregation_mode == "null"assertion. The delayer's logic only depends on the per-iteration prefill scheduling path, which exists in bothnullandprefillmodes.python/sglang/srt/managers/scheduler.py: when constructing the delayer, skip it (and log) on adecodeengine. A decode engine has no prefill scheduling path, so--enable-prefill-delayerwould be silently unused; this makes that explicit and avoids allocating the gather buffer / cpu_group state on decode workers.No behavior change for non-disaggregated deployments. Existing
--prefill-delayer-max-delay-passes/--prefill-delayer-token-usage-low-watermarkflags work unchanged on the prefill engine of a disaggregated PD setup.Accuracy Tests
N/A — this is a scheduler-side gating change. No model forward / kernel code is touched. Generation outputs are unaffected.
Speed Tests and Profiling
Tested locally on 1×H200 node,
Qwen/Qwen3-30B-A3B, prefill engine launched as:(Without this PR the
--enable-prefill-delayervariant aborts atPrefillDelayer.__init__with the assertion.)Client: 4 concurrent requests, 8192-token prompt, 1 output token, round-robin over the 4 DP ranks, no client-side stagger.
Without
--enable-prefill-delayer: requests split into two batchesPer-DP
Prefill batchlog on the prefill engine — DP0 fires alone first, DPs 1/2/3 fuse in the next iteration:With
--enable-prefill-delayer: requests fused into one batchAll four DPs prefill in the same iteration:
End-to-end wall clock for 4 concurrent reqs drops from 3.023 s → 1.957 s (~35% faster) because the EP all-to-all is amortized over one forward pass instead of two.
Checklist