[DLLM] Implement initial dynamic batching for diffusion LLM#14883
[DLLM] Implement initial dynamic batching for diffusion LLM#14883hnyls2002 merged 6 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @ClawSeven, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant enhancements to the SGLang framework by enabling dynamic batching for Diffusion LLMs (DLLM). The changes allow the system to efficiently process multiple DLLM requests concurrently, leading to improved resource utilization and overall performance. This involved a fundamental shift in how chunked requests are managed, moving from a single request model to a collection-based approach, alongside corresponding updates to the DLLM algorithm for batched processing and adaptations within the scheduler and output handling mechanisms. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for dynamic batching in the Diffusion Language Model (DLLM) components. The changes primarily involve modifying the scheduler to handle batches of chunked requests and updating the low-confidence algorithm to process these batches. I've identified a critical issue in the scheduler logic that could lead to incorrect behavior, as well as a significant performance issue in the memory pool's double-free check. Additionally, there are some opportunities for code improvement in terms of readability and efficiency, and an important Fixme that should be addressed.
| start = len(forward_batch.input_ids) - torch.sum(mask_index).item() | ||
|
|
||
| # Fast path: if there is no mask token, forward and save kv cache | ||
| if torch.sum(mask_index).item() == 0: |
There was a problem hiding this comment.
Using torch.sum(mask_index).item() == 0 to check for the absence of True values in a boolean tensor is less idiomatic and potentially less efficient than using not mask_index.any(). This pattern is repeated throughout the file (lines 49, 54, 67, 87). I recommend replacing all occurrences for better readability and performance.
| if torch.sum(mask_index).item() == 0: | |
| if not mask_index.any(): |
c36400c to
eef4e81
Compare
82a13b7 to
5558186
Compare
5558186 to
b5192ac
Compare
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
7670ee5 to
1f3537f
Compare
| ) | ||
|
|
||
|
|
||
| class DllmReqs: |
There was a problem hiding this comment.
I noticed that the DllmReqs will not be passed as Forwardbatch to the algorithm interface yet. I wonder whether we will pass it to the algorithm interface in the future to enable the token shift for algorithms like fast-dllm-v2. @ClawSeven It looks like passing the DllmReq to the algorithm interface can enable more flexibility in the algorithm side.
There was a problem hiding this comment.
I believe we will provide more reqs information for the DLLM algorithm, but I may not parse all of the reqs.
There was a problem hiding this comment.
We may discuss this issue in the other algorithm support PR like fast dllm v2. Keep this for now.
| ) | ||
|
|
||
| max_running_requests = ( | ||
| 1 |
There was a problem hiding this comment.
Shall we set it to a more reasonable default value rather than 1 in the future? As in speculative decoding, it is set to 48.
sglang/python/sglang/srt/server_args.py
Line 2146 in d0fb24e
Even if we want to show a latency-sensitive case, maybe 2 would be better?
There was a problem hiding this comment.
The current dLLM implementation does not separate prefill and decode batching, meaning the PR is still in the initial version and only supports dynamic batching. Performance optimization will be added later. The default batch size will be set to 8 in the future.
There was a problem hiding this comment.
I see... That makes sense
| curr_block_start:curr_block_end, | ||
| ] | ||
|
|
||
| x = torch.argmax(curr_logits, dim=-1) |
There was a problem hiding this comment.
In the future, we may need to replace the argmax to support sampling. #16615
There was a problem hiding this comment.
Yes, we have a PR for implementing more sampling algorithms.
hnyls2002
left a comment
There was a problem hiding this comment.
Add a unit-test for that.
|
I don’t think Maybe rename it to something like:
For the |
128750d to
d658c2a
Compare
@hnyls2002 Hi, |
|
/tag-and-rerun-ci |
Motivation
The current dLLM implementations lack batching capabilities, with batch size set to 1 by default. Additionally, the existing dLLM implementation is tightly coupled with chunked prefill execution, limiting flexibility and extensibility.
Modifications
This PR introduces two key improvements:
No new arguments are required—this relies solely on the existing max-running-requests configuration. Note that this is an initial implementation of DLLM batching and is not yet optimized. Currently, prefill and decoding requests are batched together, which may lead to redundant computations. We will address these optimizations in subsequent PRs
Currently, we still recommend using batch size 1 for running dLLM until the batching optimization PRs are merged.
Accuracy Tests
Tested on H20
Here I added a dLLM batching‑accuracy test instead of changing the max‑running‑requests argument in the previous LLaDA unit test, since the current batching performance is suboptimal. Once the optimization PRs are merged, I’ll remove this test.
Benchmarking and Profiling
Checklist