[DLLM] Implement initial dynamic batching for diffusion LLM by ClawSeven · Pull Request #14883 · sgl-project/sglang

ClawSeven · 2025-12-11T08:14:25Z

Motivation

The current dLLM implementations lack batching capabilities, with batch size set to 1 by default. Additionally, the existing dLLM implementation is tightly coupled with chunked prefill execution, limiting flexibility and extensibility.

Modifications

This PR introduces two key improvements:

Basic Batching Support: Enables the dLLM to batch incoming requests when the number of running requests has not reached the configured limit. New requests will be grouped with existing batches to improve resource utilization.
Decoupling of dLLM and Chunked Prefill: Separates the dLLM logic from the chunked prefill execution, laying the foundation for future enhancements such as radix cache support and overlap scheduling.

No new arguments are required—this relies solely on the existing max-running-requests configuration. Note that this is an initial implementation of DLLM batching and is not yet optimized. Currently, prefill and decoding requests are batched together, which may lead to redundant computations. We will address these optimizations in subsequent PRs

Currently, we still recommend using batch size 1 for running dLLM until the batching optimization PRs are merged.

Accuracy Tests

Tested on H20

Accuracy: 0.900
Invalid: 0.000
Latency: 263.453 s
Output throughput: 97.638 token/s
metrics={'accuracy': 0.9, 'invalid': 0.0, 'latency': 263.45285430923104, 'output_throughput': 97.63796284327712}

Here I added a dLLM batching‑accuracy test instead of changing the max‑running‑requests argument in the previous LLaDA unit test, since the current batching performance is suboptimal. Once the optimization PRs are merged, I’ll remove this test.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-11T08:14:56Z

Summary of Changes

Hello @ClawSeven, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the SGLang framework by enabling dynamic batching for Diffusion LLMs (DLLM). The changes allow the system to efficiently process multiple DLLM requests concurrently, leading to improved resource utilization and overall performance. This involved a fundamental shift in how chunked requests are managed, moving from a single request model to a collection-based approach, alongside corresponding updates to the DLLM algorithm for batched processing and adaptations within the scheduler and output handling mechanisms.

Highlights

Dynamic Batching for DLLM: The core change introduces dynamic batching capabilities for Diffusion LLMs (DLLM), allowing the system to process multiple DLLM requests concurrently within a single batch, enhancing throughput and efficiency.
New ChunkedReqs Class: A new ChunkedReqs class has been implemented to manage a collection of chunked requests, replacing the previous single chunked_req approach. This class handles adding, flushing, and updating the status of multiple requests.
Configurable Concurrent Requests: The DllmConfig now includes a max_running_requests parameter, enabling users to configure the maximum number of concurrent DLLM requests the system can handle.
Batched Processing in Low Confidence Algorithm: The LowConfidence algorithm (low_confidence.py) has been updated to process input IDs and logits in a batched manner, iterating over multiple blocks and returning a list of next token IDs, reflecting the new batched approach.
Scheduler and Output Processor Refactoring: The scheduler and output processing logic have been refactored to seamlessly integrate with the new ChunkedReqs object and correctly handle the batched output generated by the DLLM algorithm.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for dynamic batching in the Diffusion Language Model (DLLM) components. The changes primarily involve modifying the scheduler to handle batches of chunked requests and updating the low-confidence algorithm to process these batches. I've identified a critical issue in the scheduler logic that could lead to incorrect behavior, as well as a significant performance issue in the memory pool's double-free check. Additionally, there are some opportunities for code improvement in terms of readability and efficiency, and an important Fixme that should be addressed.

gemini-code-assist · 2025-12-11T08:18:15Z

-        start = len(forward_batch.input_ids) - torch.sum(mask_index).item()
+
+        # Fast path: if there is no mask token, forward and save kv cache
+        if torch.sum(mask_index).item() == 0:


Using torch.sum(mask_index).item() == 0 to check for the absence of True values in a boolean tensor is less idiomatic and potentially less efficient than using not mask_index.any(). This pattern is repeated throughout the file (lines 49, 54, 67, 87). I recommend replacing all occurrences for better readability and performance.

Suggested change

if torch.sum(mask_index).item() == 0:

if not mask_index.any():

ClawSeven · 2025-12-26T03:36:16Z

/tag-and-rerun-ci

zhaochenyang20 · 2025-12-27T19:43:05Z

/rerun-failed-ci

ClawSeven · 2025-12-29T02:36:04Z

/rerun-failed-ci

Monstertail · 2026-01-03T22:46:45Z

        )


+class DllmReqs:


I noticed that the DllmReqs will not be passed as Forwardbatch to the algorithm interface yet. I wonder whether we will pass it to the algorithm interface in the future to enable the token shift for algorithms like fast-dllm-v2. @ClawSeven It looks like passing the DllmReq to the algorithm interface can enable more flexibility in the algorithm side.

I believe we will provide more reqs information for the DLLM algorithm, but I may not parse all of the reqs.

We may discuss this issue in the other algorithm support PR like fast dllm v2. Keep this for now.

Monstertail · 2026-01-03T23:31:32Z

            )

+        max_running_requests = (
+            1


Shall we set it to a more reasonable default value rather than 1 in the future? As in speculative decoding, it is set to 48.

sglang/python/sglang/srt/server_args.py

Line 2146 in d0fb24e

self.max_running_requests = 48

Even if we want to show a latency-sensitive case, maybe 2 would be better?

The current dLLM implementation does not separate prefill and decode batching, meaning the PR is still in the initial version and only supports dynamic batching. Performance optimization will be added later. The default batch size will be set to 8 in the future.

I see... That makes sense

Monstertail · 2026-01-07T03:19:07Z

+                    curr_block_start:curr_block_end,
+                ]
+
+                x = torch.argmax(curr_logits, dim=-1)


In the future, we may need to replace the argmax to support sampling. #16615

Yes, we have a PR for implementing more sampling algorithms.

hnyls2002

Add a unit-test for that.

hnyls2002 · 2026-01-14T08:49:06Z

@ClawSeven

I don’t think dllm_reqs is a good name here.
This variable represents an intermediate/in-flight state (similar to chunked_req), not a stable set of requests. Using dllm_reqs is confusing and makes it unclear what stage these requests are in.

Maybe rename it to something like:

dllm_running_requests
dllm_ongoing_requests

For the has_running_req logic, you can just rely on the container itself (e.g. size() / empty()).

ClawSeven · 2026-01-15T07:55:10Z

@ClawSeven

I don’t think dllm_reqs is a good name here. This variable represents an intermediate/in-flight state (similar to chunked_req), not a stable set of requests. Using dllm_reqs is confusing and makes it unclear what stage these requests are in.

Maybe rename it to something like:

dllm_running_requests

dllm_ongoing_requests

For the has_running_req logic, you can just rely on the container itself (e.g. size() / empty()).

@hnyls2002 Hi,
I'm currently using dllm_staging_reqs as a replacement for dllm_reqs, and I'm using non_empty() as the function name instead of has_running_req. Do you have any suggestions or feedback?

hnyls2002 · 2026-01-16T07:47:50Z

/tag-and-rerun-ci

ClawSeven requested review from ByronHsu, ShangmingCai, Ying1123, hanming-lu, hnyls2002, merrymercy, xiezhq-hermann, yizhang2077 and zhyncs as code owners December 11, 2025 08:14

ClawSeven marked this pull request as draft December 11, 2025 08:14

ClawSeven mentioned this pull request Dec 11, 2025

[Roadmap] Diffusion LLMs (2026 S1) #14199

Open

49 tasks

gemini-code-assist Bot reviewed Dec 11, 2025

View reviewed changes

ClawSeven force-pushed the dllm-batching branch from c36400c to eef4e81 Compare December 11, 2025 08:18

ClawSeven force-pushed the dllm-batching branch 2 times, most recently from 82a13b7 to 5558186 Compare December 25, 2025 12:35

ClawSeven marked this pull request as ready for review December 25, 2025 12:36

ClawSeven force-pushed the dllm-batching branch from 5558186 to b5192ac Compare December 25, 2025 12:40

github-actions Bot added the run-ci label Dec 26, 2025

sgl-project deleted a comment from ClawSeven Dec 27, 2025

ClawSeven force-pushed the dllm-batching branch from 7670ee5 to 1f3537f Compare December 29, 2025 03:11

Monstertail reviewed Jan 3, 2026

View reviewed changes

Monstertail reviewed Jan 7, 2026

View reviewed changes

Monstertail mentioned this pull request Jan 7, 2026

[Feature] Support sampling in dLLM algorithms #16615

Closed

2 tasks

hnyls2002 removed the run-ci label Jan 14, 2026

hnyls2002 requested changes Jan 14, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/utils.py Outdated

hnyls2002 requested changes Jan 14, 2026

View reviewed changes

[DLLM] Implement initial dynamic batching for diffusion LLM

d658c2a

ClawSeven force-pushed the dllm-batching branch from 128750d to d658c2a Compare January 15, 2026 07:48

ClawSeven changed the title ~~[DLLM] Support dynamic batching in dllm~~ [DLLM] Implement initial dynamic batching for diffusion LLM Jan 15, 2026

ClawSeven and others added 4 commits January 16, 2026 03:07

[CI] Add dLLM batching accuracy test

93f3b52

Rename filter_finished_reqs

7d29120

tiny fix

d266bd5

simple rename

5a91ca0

hnyls2002 approved these changes Jan 16, 2026

View reviewed changes

hnyls2002 added the high priority label Jan 16, 2026

github-actions Bot added the run-ci label Jan 16, 2026

fix wrong test suite

6785e18

hnyls2002 merged commit d2c8638 into sgl-project:main Jan 17, 2026
296 of 318 checks passed

btw616 mentioned this pull request Jan 21, 2026

[DLLM] Remove cuda graph batch size limitation #17458

Merged

5 tasks

Monstertail mentioned this pull request Jan 22, 2026

[DLLM]Fast-dLLM-v2 support with HierarchyBlock algorithm for parallel decoding #17577

Draft

5 tasks

SensenGao mentioned this pull request Apr 5, 2026

[DLLM]Fast-dLLM-v2 support with HierarchyBlock algorithm for parallel decoding #22156

Open

5 tasks

jonahbernard mentioned this pull request Apr 17, 2026

[Pipeline Parallelism][Bug] Fix scheduler hang in pipeline parallelism setup #23006

Merged

5 tasks

	if torch.sum(mask_index).item() == 0:
	if not mask_index.any():

		)


		class DllmReqs:

Conversation

ClawSeven commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

ClawSeven commented Dec 26, 2025

Uh oh!

zhaochenyang20 commented Dec 27, 2025

Uh oh!

ClawSeven commented Dec 29, 2025

Uh oh!

Monstertail Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

ClawSeven Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Monstertail Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Monstertail Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClawSeven Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Monstertail Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Monstertail Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClawSeven Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

hnyls2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hnyls2002 commented Jan 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ClawSeven commented Jan 15, 2026

Uh oh!

hnyls2002 commented Jan 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

ClawSeven commented Dec 11, 2025 •

edited

Loading

Monstertail Jan 3, 2026 •

edited

Loading

Monstertail Jan 7, 2026 •

edited

Loading