Skip to content

Fix AssertionError crash in disagg prefill inflight queue with PP#20686

Merged
ShangmingCai merged 4 commits intosgl-project:mainfrom
dubin555:oss-scout/verify-20485-pop-bootstrapped-assert
Mar 16, 2026
Merged

Fix AssertionError crash in disagg prefill inflight queue with PP#20686
ShangmingCai merged 4 commits intosgl-project:mainfrom
dubin555:oss-scout/verify-20485-pop-bootstrapped-assert

Conversation

@dubin555
Copy link
Copy Markdown
Contributor

Motivation

Fixes #20485

In pipeline-parallel (PP) disaggregated prefill, the inflight queue poll can encounter transient states due to cross-rank clock skew. The hard assert and assert False crash the scheduler instead of handling these gracefully.

Changes

  • Replace the assert poll == Success or poll == Failed with a check that treats unexpected/transient states as undone (re-queued for next poll cycle)
  • Replace assert False fallback with a warning log + re-queue, preventing scheduler crashes
  • Both changes add warning logs for observability

Testing

  • Added test_disagg_prefill_inflight_poll.py with unit tests covering transient state handling and the unexpected-state fallback path

In PD disaggregation + pipeline parallelism (PP) mode, the previous
PP rank may reach a terminal poll state (Success/Failed) and propagate
the rid via consensus before the current rank's local poll catches up.
The local poll can still be in a transient state (Transferring,
WaitingForInput, or even Bootstrapping) due to clock skew or
propagation delay.

The hard assert at line 572 crashed the scheduler when the local poll
was not Success or Failed:

    assert poll == KVPoll.Success or poll == KVPoll.Failed

Replace both asserts in process_disagg_prefill_inflight_queue with
defensive handling that logs a warning and treats the request as
undone. The request stays in the inflight queue and will be retried
in the next iteration, giving the local transfer time to complete.

Fixes sgl-project#20485
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical stability issue in the pipeline-parallel (PP) disaggregated prefill mechanism. Previously, transient states caused by clock skew between ranks could lead to AssertionError crashes in the scheduler. The changes introduce more robust error handling by gracefully re-queuing requests in unexpected or transient states and logging warnings, significantly improving the system's resilience and preventing service disruptions.

Highlights

  • Robust Error Handling: Replaced a hard assert in the inflight queue poll logic with a check that treats unexpected/transient states (due to cross-rank clock skew in PP mode) as undone, re-queuing them for the next poll cycle instead of crashing.
  • Graceful Fallback: Replaced a assert False fallback with a warning log and re-queue mechanism, preventing scheduler crashes when encountering unexpected polling states.
  • Improved Observability: Added warning logs to both modified sections to provide better visibility into transient and unexpected states, aiding in debugging and monitoring.
  • New Unit Tests: Introduced test_disagg_prefill_inflight_poll.py with unit tests specifically designed to cover the new defensive handling of transient states and the unexpected-state fallback path.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/disaggregation/prefill.py
    • Modified the process_disagg_prefill_inflight_queue function to replace a strict assertion with a check that allows for transient KVPoll states (like WaitingForInput, Transferring, Bootstrapping) in pipeline-parallel mode, treating them as undone and re-queuing them.
    • Replaced a assert False statement with a warning log and re-queue for any other unexpected polling states, ensuring the scheduler does not crash.
  • test/srt/test_disagg_prefill_inflight_poll.py
    • Added a new test file containing unittest classes to simulate and verify the corrected inflight_queue logic, ensuring that both old (crashing) and new (graceful) behaviors are correctly tested for various transient and terminal KVPoll states.
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an AssertionError crash within the disaggregated prefill inflight queue when operating with pipeline parallelism. The changes replace hard assertions on poll states with more robust error handling. Now, transient or unexpected states are treated as "undone" and are re-queued, accompanied by a warning log. This prevents the scheduler from crashing due to transient conditions like cross-rank clock skew. Additionally, a new test file with comprehensive unit tests has been introduced to validate this new, more resilient logic. The implementation is correct and effectively resolves the reported issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +579 to +580
KVPoll.WaitingForInput,
KVPoll.Transferring,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reqs with these two states might not be in the rids_to_check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, rids with WaitingForInput/Transferring states wouldn't be in rids_to_check since the previous stage only sends terminal-state rids. Simplified the check to just (KVPoll.Success, KVPoll.Failed). Thanks!

logger.warning(
"PP rank %d: unexpected poll state %s for rid %s "
"from consensus; treating as undone",
self.tp_rank,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why tp_rank here? I assume we want to log about pp rank?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed to self.pp_rank.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add a test. It is just a simple status bugfix for pp consensus.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, removed.

- Remove WaitingForInput/Transferring from consensus poll check since
  rids in those states won't appear in rids_to_check
- Use pp_rank instead of tp_rank in the warning log message
- Remove test file per reviewer feedback
Comment on lines +580 to +586
logger.warning(
"PP rank %d: unexpected poll state %s for rid %s "
"from consensus; treating as undone",
self.pp_rank,
poll,
req.rid,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use f-string here? Make sure the style is consistent with the other logging in this file, it can improve readability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — switched to f-strings in both places.

Comment on lines +616 to +621
logger.warning(
"Unexpected polling state %s for rid %s in inflight queue; "
"treating as undone",
poll,
req.rid,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as well.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/rerun-stage stage-c-test-8-gpu-h20

@github-actions
Copy link
Copy Markdown
Contributor

✅ Triggered stage-c-test-8-gpu-h20 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

🔗 View workflow run

@ShangmingCai
Copy link
Copy Markdown
Collaborator

image CI has passed.

@ShangmingCai ShangmingCai merged commit d3c0f43 into sgl-project:main Mar 16, 2026
58 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Mooncake Disaggregation + PP: Prefill Bootstrap Timeout causes AssertionError crash in pop_bootstrapped due to Decode KV Cache Saturation

2 participants