Skip to content

[DLLM] Implement initial dynamic batching for diffusion LLM#14883

Merged
hnyls2002 merged 6 commits intosgl-project:mainfrom
ClawSeven:dllm-batching
Jan 17, 2026
Merged

[DLLM] Implement initial dynamic batching for diffusion LLM#14883
hnyls2002 merged 6 commits intosgl-project:mainfrom
ClawSeven:dllm-batching

Conversation

@ClawSeven
Copy link
Copy Markdown
Collaborator

@ClawSeven ClawSeven commented Dec 11, 2025

Motivation

The current dLLM implementations lack batching capabilities, with batch size set to 1 by default. Additionally, the existing dLLM implementation is tightly coupled with chunked prefill execution, limiting flexibility and extensibility.

Modifications

This PR introduces two key improvements:

  • Basic Batching Support: Enables the dLLM to batch incoming requests when the number of running requests has not reached the configured limit. New requests will be grouped with existing batches to improve resource utilization.
  • Decoupling of dLLM and Chunked Prefill: Separates the dLLM logic from the chunked prefill execution, laying the foundation for future enhancements such as radix cache support and overlap scheduling.

No new arguments are required—this relies solely on the existing max-running-requests configuration. Note that this is an initial implementation of DLLM batching and is not yet optimized. Currently, prefill and decoding requests are batched together, which may lead to redundant computations. We will address these optimizations in subsequent PRs

Currently, we still recommend using batch size 1 for running dLLM until the batching optimization PRs are merged.

Accuracy Tests

Tested on H20

Accuracy: 0.900
Invalid: 0.000
Latency: 263.453 s
Output throughput: 97.638 token/s
metrics={'accuracy': 0.9, 'invalid': 0.0, 'latency': 263.45285430923104, 'output_throughput': 97.63796284327712}

Here I added a dLLM batching‑accuracy test instead of changing the max‑running‑requests argument in the previous LLaDA unit test, since the current batching performance is suboptimal. Once the optimization PRs are merged, I’ll remove this test.

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ClawSeven, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the SGLang framework by enabling dynamic batching for Diffusion LLMs (DLLM). The changes allow the system to efficiently process multiple DLLM requests concurrently, leading to improved resource utilization and overall performance. This involved a fundamental shift in how chunked requests are managed, moving from a single request model to a collection-based approach, alongside corresponding updates to the DLLM algorithm for batched processing and adaptations within the scheduler and output handling mechanisms.

Highlights

  • Dynamic Batching for DLLM: The core change introduces dynamic batching capabilities for Diffusion LLMs (DLLM), allowing the system to process multiple DLLM requests concurrently within a single batch, enhancing throughput and efficiency.
  • New ChunkedReqs Class: A new ChunkedReqs class has been implemented to manage a collection of chunked requests, replacing the previous single chunked_req approach. This class handles adding, flushing, and updating the status of multiple requests.
  • Configurable Concurrent Requests: The DllmConfig now includes a max_running_requests parameter, enabling users to configure the maximum number of concurrent DLLM requests the system can handle.
  • Batched Processing in Low Confidence Algorithm: The LowConfidence algorithm (low_confidence.py) has been updated to process input IDs and logits in a batched manner, iterating over multiple blocks and returning a list of next token IDs, reflecting the new batched approach.
  • Scheduler and Output Processor Refactoring: The scheduler and output processing logic have been refactored to seamlessly integrate with the new ChunkedReqs object and correctly handle the batched output generated by the DLLM algorithm.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dynamic batching in the Diffusion Language Model (DLLM) components. The changes primarily involve modifying the scheduler to handle batches of chunked requests and updating the low-confidence algorithm to process these batches. I've identified a critical issue in the scheduler logic that could lead to incorrect behavior, as well as a significant performance issue in the memory pool's double-free check. Additionally, there are some opportunities for code improvement in terms of readability and efficiency, and an important Fixme that should be addressed.

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/schedule_policy.py Outdated
Comment thread python/sglang/srt/mem_cache/memory_pool.py
start = len(forward_batch.input_ids) - torch.sum(mask_index).item()

# Fast path: if there is no mask token, forward and save kv cache
if torch.sum(mask_index).item() == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using torch.sum(mask_index).item() == 0 to check for the absence of True values in a boolean tensor is less idiomatic and potentially less efficient than using not mask_index.any(). This pattern is repeated throughout the file (lines 49, 54, 67, 87). I recommend replacing all occurrences for better readability and performance.

Suggested change
if torch.sum(mask_index).item() == 0:
if not mask_index.any():

@ClawSeven ClawSeven force-pushed the dllm-batching branch 2 times, most recently from 82a13b7 to 5558186 Compare December 25, 2025 12:35
@ClawSeven ClawSeven marked this pull request as ready for review December 25, 2025 12:36
@ClawSeven
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@sgl-project sgl-project deleted a comment from ClawSeven Dec 27, 2025
@zhaochenyang20
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

1 similar comment
@ClawSeven
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

)


class DllmReqs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the DllmReqs will not be passed as Forwardbatch to the algorithm interface yet. I wonder whether we will pass it to the algorithm interface in the future to enable the token shift for algorithms like fast-dllm-v2. @ClawSeven It looks like passing the DllmReq to the algorithm interface can enable more flexibility in the algorithm side.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will provide more reqs information for the DLLM algorithm, but I may not parse all of the reqs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may discuss this issue in the other algorithm support PR like fast dllm v2. Keep this for now.

)

max_running_requests = (
1
Copy link
Copy Markdown
Contributor

@Monstertail Monstertail Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we set it to a more reasonable default value rather than 1 in the future? As in speculative decoding, it is set to 48.

self.max_running_requests = 48

Even if we want to show a latency-sensitive case, maybe 2 would be better?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current dLLM implementation does not separate prefill and decode batching, meaning the PR is still in the initial version and only supports dynamic batching. Performance optimization will be added later. The default batch size will be set to 8 in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... That makes sense

curr_block_start:curr_block_end,
]

x = torch.argmax(curr_logits, dim=-1)
Copy link
Copy Markdown
Contributor

@Monstertail Monstertail Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we may need to replace the argmax to support sampling. #16615

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have a PR for implementing more sampling algorithms.

Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a unit-test for that.

Comment thread python/sglang/srt/managers/utils.py Outdated
@hnyls2002
Copy link
Copy Markdown
Collaborator

@ClawSeven

I don’t think dllm_reqs is a good name here.
This variable represents an intermediate/in-flight state (similar to chunked_req), not a stable set of requests. Using dllm_reqs is confusing and makes it unclear what stage these requests are in.

Maybe rename it to something like:

  • dllm_running_requests
  • dllm_ongoing_requests

For the has_running_req logic, you can just rely on the container itself (e.g. size() / empty()).

Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/schedule_batch.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py Outdated
Comment thread python/sglang/srt/managers/scheduler.py Outdated
@ClawSeven ClawSeven changed the title [DLLM] Support dynamic batching in dllm [DLLM] Implement initial dynamic batching for diffusion LLM Jan 15, 2026
@ClawSeven
Copy link
Copy Markdown
Collaborator Author

@ClawSeven

I don’t think dllm_reqs is a good name here. This variable represents an intermediate/in-flight state (similar to chunked_req), not a stable set of requests. Using dllm_reqs is confusing and makes it unclear what stage these requests are in.

Maybe rename it to something like:

  • dllm_running_requests
  • dllm_ongoing_requests

For the has_running_req logic, you can just rely on the container itself (e.g. size() / empty()).

@hnyls2002 Hi,
I'm currently using dllm_staging_reqs as a replacement for dllm_reqs, and I'm using non_empty() as the function name instead of has_running_req. Do you have any suggestions or feedback?

@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants