[DLLM] Basic dLLM scheduling strategy and implementation#17484
[DLLM] Basic dLLM scheduling strategy and implementation#17484ispobock merged 2 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @ClawSeven, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural refactor for Diffusion LLM (dLLM) scheduling within the system. The primary goal is to establish a clear and robust separation between prefill and decode batch processing for dLLM requests. This is achieved by introducing dedicated classes and methods that manage the lifecycle and scheduling of dLLM requests, from their initial incoming state through staging and execution. The changes enhance the system's ability to handle dLLM workloads efficiently and lay the groundwork for future optimizations specific to these distinct processing phases. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant refactoring of the dLLM scheduling architecture to separate prefill and decode batches. The changes include replacing DllmStagingReqs with a new DllmManager and a SchedulerDllmMixin to better encapsulate the dLLM-specific logic. This is a good architectural improvement that enhances modularity.
My review has identified one critical issue in the new DllmManager that could lead to requests being dropped during scheduling. I have also included a medium-severity suggestion to improve code conciseness. Please address the critical issue to ensure the correctness of the new scheduling logic.
| def init_next_round(self) -> None: | ||
| """Initialize staging requests for next round and clear staging queue.""" | ||
| for req in self.staging_queue: | ||
| req.init_next_round_input() | ||
| self.staging_queue = [] |
There was a problem hiding this comment.
In init_next_round, the staging_queue is cleared after processing, but the requests within it are not re-queued. This will cause unfinished chunked dLLM requests to be dropped from scheduling, leading to requests hanging. The requests from staging_queue should be moved to waiting_queue to be considered for the next scheduling cycle.
| def init_next_round(self) -> None: | |
| """Initialize staging requests for next round and clear staging queue.""" | |
| for req in self.staging_queue: | |
| req.init_next_round_input() | |
| self.staging_queue = [] | |
| def init_next_round(self) -> None: | |
| """Initialize staging requests for next round and move them to the waiting queue.""" | |
| for req in self.staging_queue: | |
| req.init_next_round_input() | |
| self.waiting_queue.extend(self.staging_queue) | |
| self.staging_queue = [] |
5dbfb88 to
48ad32d
Compare
48ad32d to
cb32fbc
Compare
|
/tag-and-rerun-ci |
|
Why with this PR, in the test with the setting 4*H20 / TP4 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8, |
|
Here is a summary of the key changes in this PR, I will add detailed reviews after Jan 29. @ClawSeven @zhaochenyang20
|
Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>
cb32fbc to
902cf28
Compare
|
/rerun-failed-ci |
…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>
…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>
…#17484) Signed-off-by: Zehuan Li <lizehuan.lzh@antgroup.com>
Motivation
The previous dLLM scheduler relied on a chunked-prefill mechanism, which limited the implementation of efficient scheduling strategies. This PR introduces a new scheduling architecture.
Modifications
This PR focuses on refactoring the dLLM scheduling implementation. Previously, the scheduler would dynamically batch all blocks together for computation. Now, I've separated prefill and decode batches to eliminate redundant calculations that occurred when prefill and decode blocks were processed together. This lays the groundwork for implementing early exit and overlap scheduling in future iterations.
To maintain clean separation, the changes are consolidated in a new scheduler_dllm_mixin.py file. This keeps the dLLM request scheduling logic contained and prevents interference with the main AR branch execution flow.
Accuracy Tests, Benchmarking and Profiling
4*H20 / TP4 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8k
87.66 token/s -> 484.5 token/s
4*H20 / TP1 / BS4 / LLaDA2.0-mini / Cuda Graph bs [1,2,3,4] / gsm8k
94.98 token/s -> 288.14 token/s
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci