[DLLM] Basic dLLM scheduling strategy and implementation by ClawSeven · Pull Request #17484 · sgl-project/sglang

ClawSeven · 2026-01-21T09:18:32Z

Motivation

The previous dLLM scheduler relied on a chunked-prefill mechanism, which limited the implementation of efficient scheduling strategies. This PR introduces a new scheduling architecture.

Modifications

This PR focuses on refactoring the dLLM scheduling implementation. Previously, the scheduler would dynamically batch all blocks together for computation. Now, I've separated prefill and decode batches to eliminate redundant calculations that occurred when prefill and decode blocks were processed together. This lays the groundwork for implementing early exit and overlap scheduling in future iterations.

To maintain clean separation, the changes are consolidated in a new scheduler_dllm_mixin.py file. This keeps the dLLM request scheduling logic contained and prevents interference with the main AR branch execution flow.

Accuracy Tests, Benchmarking and Profiling

4*H20 / TP4 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8k

87.66 token/s -> 484.5 token/s

W/ this  PR:

Accuracy: 0.925
Invalid: 0.000
Latency: 52.095 s
Output throughput: 484.502 token/s
metrics={'accuracy': 0.925, 'invalid': 0.0, 'latency': 52.094728492200375, 'output_throughput': 484.5019972371856}
.
----------------------------------------------------------------------
W/O this PR:

Accuracy: 0.900
Invalid: 0.000
Latency: 287.565 s
Output throughput: 87.660 token/s
metrics={'accuracy': 0.9, 'invalid': 0.0, 'latency': 287.5653794184327, 'output_throughput': 87.66006551616272}
.
----------------------------------------------------------------------

4*H20 / TP1 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8k

94.98 token/s -> 288.14 token/s

Accuracy: 0.915
Invalid: 0.000
Latency: 86.117 s
Output throughput: 288.142 token/s
metrics={'accuracy': 0.915, 'invalid': 0.0, 'latency': 86.11712778359652, 'output_throughput': 288.1424478340131}

----------------------------------------------------------------------


Accuracy: 0.915
Invalid: 0.000
Latency: 259.301 s
Output throughput: 94.982 token/s
metrics={'accuracy': 0.915, 'invalid': 0.0, 'latency': 259.30089492350817, 'output_throughput': 94.98231777127253}

----------------------------------------------------------------------

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-21T09:19:04Z

Summary of Changes

Hello @ClawSeven, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural refactor for Diffusion LLM (dLLM) scheduling within the system. The primary goal is to establish a clear and robust separation between prefill and decode batch processing for dLLM requests. This is achieved by introducing dedicated classes and methods that manage the lifecycle and scheduling of dLLM requests, from their initial incoming state through staging and execution. The changes enhance the system's ability to handle dLLM workloads efficiently and lay the groundwork for future optimizations specific to these distinct processing phases.

Highlights

DLLM Request Phase Management: Introduced a new DllmReqPhase enum (STAGING_PREFILL, STAGING_DECODE, INCOMING_PREFILL, INCOMING_DECODE, NOT_DLLM) within the Req class to precisely track the state of Diffusion LLM requests, enabling distinct handling for prefill and decode operations.
Dedicated DLLM Scheduling Architecture: The core Diffusion LLM scheduling logic has been extracted and refactored into a new SchedulerDllmMixin class and an accompanying DllmManager class. This modularization improves code organization and allows for specialized scheduling policies for dLLM.
Separated Prefill and Decode Batch Processing: The scheduler now explicitly differentiates between dLLM prefill and decode batches, with a new get_new_batch_dllm method in SchedulerDllmMixin responsible for orchestrating the processing of these distinct batch types.
Enhanced Request State Determination: The Req class now includes methods like is_dllm_prefill and determine_dllm_phase to dynamically ascertain whether a dLLM request is in a prefill or decode stage based on its input content and dllm_config.
New DLLM_DECODE Forward Mode: A DLLM_DECODE mode has been added to the ForwardMode enum, indicating specific forward pass behavior for dLLM decode operations, complementing the existing DLLM_EXTEND mode.
Compatibility and Feature Restrictions: Several features, including hierarchical cache, LoRA, disaggregation, and mixed chunked prefill, are now explicitly disabled or warned against when dLLM inference is enabled, ensuring compatibility and preventing unsupported configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant refactoring of the dLLM scheduling architecture to separate prefill and decode batches. The changes include replacing DllmStagingReqs with a new DllmManager and a SchedulerDllmMixin to better encapsulate the dLLM-specific logic. This is a good architectural improvement that enhances modularity.

My review has identified one critical issue in the new DllmManager that could lead to requests being dropped during scheduling. I have also included a medium-severity suggestion to improve code conciseness. Please address the critical issue to ensure the correctness of the new scheduling logic.

gemini-code-assist · 2026-01-21T09:21:34Z

+    def init_next_round(self) -> None:
+        """Initialize staging requests for next round and clear staging queue."""
+        for req in self.staging_queue:
+            req.init_next_round_input()
+        self.staging_queue = []


In init_next_round, the staging_queue is cleared after processing, but the requests within it are not re-queued. This will cause unfinished chunked dLLM requests to be dropped from scheduling, leading to requests hanging. The requests from staging_queue should be moved to waiting_queue to be considered for the next scheduling cycle.

Suggested change

def init_next_round(self) -> None:

"""Initialize staging requests for next round and clear staging queue."""

for req in self.staging_queue:

req.init_next_round_input()

self.staging_queue = []

def init_next_round(self) -> None:

"""Initialize staging requests for next round and move them to the waiting queue."""

for req in self.staging_queue:

req.init_next_round_input()

self.waiting_queue.extend(self.staging_queue)

self.staging_queue = []

ClawSeven · 2026-01-22T09:55:43Z

/tag-and-rerun-ci

Monstertail · 2026-01-27T06:57:55Z

Why with this PR, in the test with the setting 4*H20 / TP4 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8,
The accuracy will be higher (0.925 V.S. 0.900) than that without this PR? @ClawSeven

Monstertail · 2026-01-27T09:06:31Z

Here is a summary of the key changes in this PR, I will add detailed reviews after Jan 29. @ClawSeven @zhaochenyang20

Motivation: The goal is to support native-style dLLM scheduling with prefill prioritization and separate the dLLM batching path from AR as much as possible.
Decoupling:
- The PR separates the execution flows for prefill and decode (splitting get_next_batch_dllm).
- This cleans up the dLLM components from AR path and paves the way for better optimization of DLLM batching in the future.
Refactoring:
- Centralization: Scattered DLLM logic is centralized into DllmMixin.
- Pipeline Management: DllmManager now handles the Waiting -> Staging -> Batch pipeline using a resource adder filter.
- Code Hygiene: The get_new_batch_prefill path is now clean, with DLLM logic isolated in get_new_batch_prefill_dllm.
State Machine:
- DllmReqPhase (Incoming/Staging+Prefill/Decode) is now dynamically determined by mask status in determine_dllm_phase().
Result:
- Eliminating the redundant computation from mixed chunks improved throughput from 200 to 484 tokens/s.

Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

ClawSeven · 2026-02-10T02:49:33Z

/rerun-failed-ci

…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

ClawSeven requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners January 21, 2026 09:18

ClawSeven marked this pull request as draft January 21, 2026 09:18

ClawSeven mentioned this pull request Jan 21, 2026

[Roadmap] Diffusion LLMs (2026 S1) #14199

Open

49 tasks

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

ClawSeven changed the title ~~[DLLM] dLLM scheduling arch refactor for prefill/decode batch seperation~~ [DLLM] Basic dLLM scheduling strategy and implementation Jan 22, 2026

ClawSeven force-pushed the dllm-pd-rebase-1 branch 2 times, most recently from 5dbfb88 to 48ad32d Compare January 22, 2026 09:51

ClawSeven marked this pull request as ready for review January 22, 2026 09:52

ClawSeven force-pushed the dllm-pd-rebase-1 branch from 48ad32d to cb32fbc Compare January 22, 2026 09:55

github-actions Bot added the run-ci label Jan 22, 2026

merrymercy approved these changes Feb 2, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/schedule_batch.py Outdated

[DLLM] dLLM scheduling arch refactor

902cf28

Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

ClawSeven force-pushed the dllm-pd-rebase-1 branch from cb32fbc to 902cf28 Compare February 9, 2026 06:59

Merge branch 'main' into dllm-pd-rebase-1

a20be52

ispobock approved these changes Feb 10, 2026

View reviewed changes

ispobock merged commit 26f2b37 into sgl-project:main Feb 10, 2026
281 of 318 checks passed

ClawSeven mentioned this pull request Feb 10, 2026

[DLLM] Fix dllm bug in scheduler #17932

Closed

5 tasks

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[DLLM] Basic dLLM scheduling strategy and implementation (sgl-project…

609708e

…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026

[DLLM] Basic dLLM scheduling strategy and implementation (sgl-project…

4859809

…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[DLLM] Basic dLLM scheduling strategy and implementation (sgl-project…

11871fd

…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DLLM] Basic dLLM scheduling strategy and implementation#17484

[DLLM] Basic dLLM scheduling strategy and implementation#17484
ispobock merged 2 commits intosgl-project:mainfrom
ClawSeven:dllm-pd-rebase-1

ClawSeven commented Jan 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Uh oh!

Uh oh!

ClawSeven commented Jan 22, 2026

Uh oh!

Monstertail commented Jan 27, 2026 •

edited

Loading

Uh oh!

Monstertail commented Jan 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

ClawSeven commented Feb 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ClawSeven commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests, Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ClawSeven commented Jan 22, 2026

Uh oh!

Monstertail commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Monstertail commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ClawSeven commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ClawSeven commented Jan 21, 2026 •

edited

Loading

Monstertail commented Jan 27, 2026 •

edited

Loading

Monstertail commented Jan 27, 2026 •

edited

Loading

ClawSeven commented Feb 10, 2026 •

edited

Loading