[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode by ByronHsu · Pull Request #23588 · sgl-project/sglang

ByronHsu · 2026-04-23T21:39:13Z

Motivation

PrefillDelayer is currently gated by assert disaggregation_mode == "null" in its constructor, so launching a disaggregated prefill engine with --enable-prefill-delayer aborts at scheduler init. However, the delayer's negotiation logic — per-iteration all_gather over DP ranks of an all/none/mixed prefillable status, with bounded max_delay_passes waiting — is exactly the mechanism a disaggregated-prefill + --enable-dp-attention deployment needs to fuse near-simultaneous prefills into a single forward pass instead of paying for repeated "1 real + N idle" passes through the EP all-to-all.

In a --tp 4 --dp 4 --ep 4 --enable-dp-attention --disaggregation-mode prefill setup, four concurrent client requests routed round-robin to four DP ranks currently end up as two scheduler iterations:

Iteration 1: DP0 prefills its req, DP1/2/3 run idle batches (still paying the EP collective)
Iteration 2: DP1/2/3 prefill their reqs

With this change, --enable-prefill-delayer can be used on the prefill engine to delay iteration 1 by a few passes so all four reqs land in the same forward, halving the number of EP all-to-alls and improving prefill throughput at small/medium prompt lengths where the per-pass collective overhead is non-trivial relative to the actual MoE compute.

Modifications

python/sglang/srt/managers/prefill_delayer.py: drop the disaggregation_mode == "null" assertion. The delayer's logic only depends on the per-iteration prefill scheduling path, which exists in both null and prefill modes.
python/sglang/srt/managers/scheduler.py: when constructing the delayer, skip it (and log) on a decode engine. A decode engine has no prefill scheduling path, so --enable-prefill-delayer would be silently unused; this makes that explicit and avoids allocating the gather buffer / cpu_group state on decode workers.

No behavior change for non-disaggregated deployments. Existing --prefill-delayer-max-delay-passes / --prefill-delayer-token-usage-low-watermark flags work unchanged on the prefill engine of a disaggregated PD setup.

Accuracy Tests

N/A — this is a scheduler-side gating change. No model forward / kernel code is touched. Generation outputs are unaffected.

Speed Tests and Profiling

Tested locally on 1×H200 node, Qwen/Qwen3-30B-A3B, prefill engine launched as:

python3 -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B \
  --tp 4 --dp 4 --ep 4 --enable-dp-attention \
  --disaggregation-mode prefill --disable-radix-cache \
  --load-balance-method round_robin --chunked-prefill-size 32768 \
  [--enable-prefill-delayer]

(Without this PR the --enable-prefill-delayer variant aborts at PrefillDelayer.__init__ with the assertion.)

Client: 4 concurrent requests, 8192-token prompt, 1 output token, round-robin over the 4 DP ranks, no client-side stagger.

Without `--enable-prefill-delayer`: requests split into two batches

python3 client.py --batch-size 4 --input-len 8192 --output-len 1 \
  --tokenizer Qwen/Qwen3-30B-A3B --dp-size 4 --flush-cache --stagger-ms 0
sent 4 reqs in 3.023s, ok=4/4

Per-DP Prefill batch log on the prefill engine — DP0 fires alone first, DPs 1/2/3 fuse in the next iteration:

DP0 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP1 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2
DP2 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2
DP3 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 2

With `--enable-prefill-delayer`: requests fused into one batch

python3 client.py --batch-size 4 --input-len 8192 --output-len 1 \
  --tokenizer Qwen/Qwen3-30B-A3B --dp-size 4 --flush-cache --stagger-ms 0
sent 4 reqs in 1.957s, ok=4/4

All four DPs prefill in the same iteration:

DP0 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP1 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP2 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1
DP3 ... Prefill batch, #new-seq: 1, #new-token: 8192   <- iter 1

End-to-end wall clock for 4 concurrent reqs drops from 3.023 s → 1.957 s (~35% faster) because the EP all-to-all is amortized over one forward pass instead of two.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

PrefillDelayer is currently gated by `disaggregation_mode == "null"`, so attempting to launch a disaggregated prefill engine with `--enable-prefill-delayer` aborts at scheduler init even though the delayer's negotiation logic (per-iteration all-gather over DP ranks, "all/none/mixed" prefillable status) is exactly what disagg-prefill + DP-attention deployments need to fuse near-simultaneous prefills into a single forward pass instead of "1 real + N idle" passes. This change: - Drops the `disaggregation_mode == "null"` assert in `PrefillDelayer`. - In the scheduler, only constructs the delayer when the engine actually schedules prefills (i.e. `null` or `prefill` modes); on a `decode` engine `--enable-prefill-delayer` is now logged and ignored instead of constructed-and-unused. No behavior change for non-disagg deployments. Existing `prefill_delayer_*` flags work unchanged on the prefill engine of a disaggregated PD setup. Made-with: Cursor

gemini-code-assist · 2026-04-23T21:39:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ByronHsu requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners April 23, 2026 21:39

ch-wan approved these changes Apr 23, 2026

View reviewed changes

ByronHsu changed the title ~~Allow PrefillDelayer in disaggregated-prefill mode~~ [PD+DP] Allow PrefillDelayer in disaggregated-prefill mode Apr 23, 2026

ByronHsu merged commit 1721035 into sgl-project:main Apr 23, 2026
57 of 65 checks passed

ByronHsu added a commit that referenced this pull request Apr 23, 2026

[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode (#23588)

e29c347

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588

[PD+DP] Allow PrefillDelayer in disaggregated-prefill mode#23588
ByronHsu merged 1 commit intosgl-project:mainfrom
ByronHsu:enable-prefill-delayer-in-disagg

ByronHsu commented Apr 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ByronHsu commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Without --enable-prefill-delayer: requests split into two batches

With --enable-prefill-delayer: requests fused into one batch

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ByronHsu commented Apr 23, 2026 •

edited

Loading

Without `--enable-prefill-delayer`: requests split into two batches

With `--enable-prefill-delayer`: requests fused into one batch