Fix AssertionError crash in disagg prefill inflight queue with PP#20686
Conversation
In PD disaggregation + pipeline parallelism (PP) mode, the previous
PP rank may reach a terminal poll state (Success/Failed) and propagate
the rid via consensus before the current rank's local poll catches up.
The local poll can still be in a transient state (Transferring,
WaitingForInput, or even Bootstrapping) due to clock skew or
propagation delay.
The hard assert at line 572 crashed the scheduler when the local poll
was not Success or Failed:
assert poll == KVPoll.Success or poll == KVPoll.Failed
Replace both asserts in process_disagg_prefill_inflight_queue with
defensive handling that logs a warning and treats the request as
undone. The request stays in the inflight queue and will be retried
in the next iteration, giving the local transfer time to complete.
Fixes sgl-project#20485
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical stability issue in the pipeline-parallel (PP) disaggregated prefill mechanism. Previously, transient states caused by clock skew between ranks could lead to AssertionError crashes in the scheduler. The changes introduce more robust error handling by gracefully re-queuing requests in unexpected or transient states and logging warnings, significantly improving the system's resilience and preventing service disruptions. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses an AssertionError crash within the disaggregated prefill inflight queue when operating with pipeline parallelism. The changes replace hard assertions on poll states with more robust error handling. Now, transient or unexpected states are treated as "undone" and are re-queued, accompanied by a warning log. This prevents the scheduler from crashing due to transient conditions like cross-rank clock skew. Additionally, a new test file with comprehensive unit tests has been introduced to validate this new, more resilient logic. The implementation is correct and effectively resolves the reported issue.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| KVPoll.WaitingForInput, | ||
| KVPoll.Transferring, |
There was a problem hiding this comment.
Reqs with these two states might not be in the rids_to_check.
There was a problem hiding this comment.
Right, rids with WaitingForInput/Transferring states wouldn't be in rids_to_check since the previous stage only sends terminal-state rids. Simplified the check to just (KVPoll.Success, KVPoll.Failed). Thanks!
| logger.warning( | ||
| "PP rank %d: unexpected poll state %s for rid %s " | ||
| "from consensus; treating as undone", | ||
| self.tp_rank, |
There was a problem hiding this comment.
Why tp_rank here? I assume we want to log about pp rank?
There was a problem hiding this comment.
Good catch — fixed to self.pp_rank.
There was a problem hiding this comment.
No need to add a test. It is just a simple status bugfix for pp consensus.
- Remove WaitingForInput/Transferring from consensus poll check since rids in those states won't appear in rids_to_check - Use pp_rank instead of tp_rank in the warning log message - Remove test file per reviewer feedback
| logger.warning( | ||
| "PP rank %d: unexpected poll state %s for rid %s " | ||
| "from consensus; treating as undone", | ||
| self.pp_rank, | ||
| poll, | ||
| req.rid, | ||
| ) |
There was a problem hiding this comment.
Could we use f-string here? Make sure the style is consistent with the other logging in this file, it can improve readability.
There was a problem hiding this comment.
Done — switched to f-strings in both places.
| logger.warning( | ||
| "Unexpected polling state %s for rid %s in inflight queue; " | ||
| "treating as undone", | ||
| poll, | ||
| req.rid, | ||
| ) |
|
/rerun-stage stage-c-test-8-gpu-h20 |
|
✅ Triggered |

Motivation
Fixes #20485
In pipeline-parallel (PP) disaggregated prefill, the inflight queue poll can encounter transient states due to cross-rank clock skew. The hard
assertandassert Falsecrash the scheduler instead of handling these gracefully.Changes
assert poll == Success or poll == Failedwith a check that treats unexpected/transient states as undone (re-queued for next poll cycle)assert Falsefallback with a warning log + re-queue, preventing scheduler crashesTesting
test_disagg_prefill_inflight_poll.pywith unit tests covering transient state handling and the unexpected-state fallback path