Skip to content

[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588

Merged
ByronHsu merged 1 commit intosgl-project:mainfrom
ByronHsu:enable-prefill-delayer-in-disagg
Apr 23, 2026
Merged

[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588
ByronHsu merged 1 commit intosgl-project:mainfrom
ByronHsu:enable-prefill-delayer-in-disagg

Conversation

@ByronHsu
Copy link
Copy Markdown
Collaborator

@ByronHsu ByronHsu commented Apr 23, 2026

Motivation

PrefillDelayer is currently gated by assert disaggregation_mode == "null" in its constructor, so launching a disaggregated prefill engine with --enable-prefill-delayer aborts at scheduler init. However, the delayer's negotiation logic — per-iteration all_gather over DP ranks of an all/none/mixed prefillable status, with bounded max_delay_passes waiting — is exactly the mechanism a disaggregated-prefill + --enable-dp-attention deployment needs to fuse near-simultaneous prefills into a single forward pass instead of paying for repeated "1 real + N idle" passes through the EP all-to-all.

In a --tp 4 --dp 4 --ep 4 --enable-dp-attention --disaggregation-mode prefill setup, four concurrent client requests routed round-robin to four DP ranks currently end up as two scheduler iterations:

  1. Iteration 1: DP0 prefills its req, DP1/2/3 run idle batches (still paying the EP collective)
  2. Iteration 2: DP1/2/3 prefill their reqs

With this change, --enable-prefill-delayer can be used on the prefill engine to delay iteration 1 by a few passes so all four reqs land in the same forward, halving the number of EP all-to-alls and improving prefill throughput at small/medium prompt lengths where the per-pass collective overhead is non-trivial relative to the actual MoE compute.

Modifications

  • python/sglang/srt/managers/prefill_delayer.py: drop the disaggregation_mode == "null" assertion. The delayer's logic only depends on the per-iteration prefill scheduling path, which exists in both null and prefill modes.
  • python/sglang/srt/managers/scheduler.py: when constructing the delayer, skip it (and log) on a decode engine. A decode engine has no prefill scheduling path, so --enable-prefill-delayer would be silently unused; this makes that explicit and avoids allocating the gather buffer / cpu_group state on decode workers.

No behavior change for non-disaggregated deployments. Existing --prefill-delayer-max-delay-passes / --prefill-delayer-token-usage-low-watermark flags work unchanged on the prefill engine of a disaggregated PD setup.

Accuracy Tests

N/A — this is a scheduler-side gating change. No model forward / kernel code is touched. Generation outputs are unaffected.

Speed Tests and Profiling

Tested locally on 1×H200 node, Qwen/Qwen3-30B-A3B, prefill engine launched as:

python3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B \
  --tp 4 --dp 4 --ep 4 --enable-dp-attention \
  --disaggregation-mode prefill --disable-radix-cache \
  --load-balance-method round_robin --chunked-prefill-size 32768 \
  [--enable-prefill-delayer]

(Without this PR the --enable-prefill-delayer variant aborts at PrefillDelayer.__init__ with the assertion.)

Client: 4 concurrent requests, 8192-token prompt, 1 output token, round-robin over the 4 DP ranks, no client-side stagger.

Without --enable-prefill-delayer: requests split into two batches

python3 client.py --batch-size 4 --input-len 8192 --output-len 1 \
  --tokenizer Qwen/Qwen3-30B-A3B --dp-size 4 --flush-cache --stagger-ms 0
sent 4 reqs in 3.023s, ok=4/4

Per-DP Prefill batch log on the prefill engine — DP0 fires alone first, DPs 1/2/3 fuse in the next iteration:

DP0 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP1 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2
DP2 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2
DP3 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2

With --enable-prefill-delayer: requests fused into one batch

python3 client.py --batch-size 4 --input-len 8192 --output-len 1 \
  --tokenizer Qwen/Qwen3-30B-A3B --dp-size 4 --flush-cache --stagger-ms 0
sent 4 reqs in 1.957s, ok=4/4

All four DPs prefill in the same iteration:

DP0 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP1 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP2 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP3 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1

End-to-end wall clock for 4 concurrent reqs drops from 3.023 s → 1.957 s (~35% faster) because the EP all-to-all is amortized over one forward pass instead of two.

Checklist

PrefillDelayer is currently gated by `disaggregation_mode == "null"`, so
attempting to launch a disaggregated prefill engine with
`--enable-prefill-delayer` aborts at scheduler init even though the
delayer's negotiation logic (per-iteration all-gather over DP ranks,
"all/none/mixed" prefillable status) is exactly what disagg-prefill +
DP-attention deployments need to fuse near-simultaneous prefills into a
single forward pass instead of "1 real + N idle" passes.

This change:

- Drops the `disaggregation_mode == "null"` assert in `PrefillDelayer`.
- In the scheduler, only constructs the delayer when the engine actually
  schedules prefills (i.e. `null` or `prefill` modes); on a `decode`
  engine `--enable-prefill-delayer` is now logged and ignored instead
  of constructed-and-unused.

No behavior change for non-disagg deployments. Existing
`prefill_delayer_*` flags work unchanged on the prefill engine of a
disaggregated PD setup.

Made-with: Cursor
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ByronHsu ByronHsu changed the title Allow PrefillDelayer in disaggregated-prefill mode [PD+DP] Allow PrefillDelayer in disaggregated-prefill mode Apr 23, 2026
@ByronHsu ByronHsu merged commit 1721035 into sgl-project:main Apr 23, 2026
57 of 65 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants