Skip to content

[Disagg] Fix health check false-positive in disagg is_fully_idle#20756

Merged
ShangmingCai merged 1 commit intomainfrom
fix/health-check-disagg-idle
Mar 17, 2026
Merged

[Disagg] Fix health check false-positive in disagg is_fully_idle#20756
ShangmingCai merged 1 commit intomainfrom
fix/health-check-disagg-idle

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented Mar 17, 2026

  • Move disagg queue checks (bootstrap_queue, prealloc_queue, transfer_queue) from the health-check idle path to the true-idle-only path
  • These queues may have items without any request actively running on GPU (stuck handshake, full KV cache, stalled transfer), so they cannot piggyback health check results through process_output
  • running_batch + waiting_queue remain as the health-check idle criteria — they guarantee active GPU processing

Fixes #20252. Related: #20296, #20191

Move disagg queue checks (bootstrap/prealloc/transfer) from the
health-check idle path to the true-idle-only path. These queues
may have items without any request running on GPU, so they cannot
piggyback health check results.

Related: #20296, #20191
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the system's health check could report a false-positive 'busy' status in disaggregated setups. By carefully separating the criteria for a true idle state from the criteria used for health checks, the change ensures that the system's health is accurately reflected, preventing unnecessary resource allocation or service disruptions when the GPU is not actively processing requests.

Highlights

  • Refined is_fully_idle Logic: The logic within the is_fully_idle function has been updated to prevent false-positives in health checks for disaggregated environments.
  • Disaggregation Queue Checks Moved: Checks for disaggregation-specific queues (bootstrap, preallocation, and transfer queues) have been moved from the general health-check idle path to only apply when determining if the system is truly idle, not just for health checks.
  • Health Check Criteria Clarified: The health check idle criteria now exclusively rely on running_batch and waiting_queue to guarantee active GPU processing, as other disaggregation queues might contain items without active GPU work.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/managers/scheduler.py
    • Added comments to the is_fully_idle method explaining the rationale behind the health check logic and the distinction between true idle and health-check idle states.
    • Modified the is_fully_idle method to conditionally include disaggregation queue checks (disagg_prefill_bootstrap_queue, disagg_decode_prealloc_queue, disagg_decode_transfer_queue) only when for_health_check is False.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a false-positive health check in disaggregated mode by refining the is_fully_idle logic. The change correctly moves the checks for disaggregation-specific queues (bootstrap_queue, prealloc_queue, transfer_queue) to be executed only when it's not a health check, as these queues can contain items without active GPU processing. The implementation is clear, and the added comment provides good context for the change. The pull request looks good and addresses the described issue effectively.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic LGTM. Real-world robustness tests also passed.

@ShangmingCai ShangmingCai merged commit 5270a06 into main Mar 17, 2026
112 of 155 checks passed
@ShangmingCai ShangmingCai deleted the fix/health-check-disagg-idle branch March 17, 2026 09:18
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
@hnyls2002 hnyls2002 mentioned this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Large scale PD Disagression bug : cascading failure in Decode/prefill servers when corresponding Prefill/decode servers go offline

2 participants