[smg][ci] fix model pool GPU cleanup and add startup reliability improvements by slin1237 · Pull Request #16745 · sgl-project/sglang

slin1237 · 2026-01-08T16:40:21Z

Fix GPU deadlock: properly release GPUs when workers fail to start
Add 5s stagger delay between worker launches to reduce resource contention
Add 30s grace period before health checks to allow model loading
Reduce health check interval from 5s to 2s for faster detection
Add stderr logging when workers fail for better debugging

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
After green CI and required approvals, ask Merge Oncalls to merge.

…ovements - Fix GPU deadlock: properly release GPUs when workers fail to start - Add 5s stagger delay between worker launches to reduce resource contention - Add 30s grace period before health checks to allow model loading - Reduce health check interval from 5s to 2s for faster detection - Add stderr logging when workers fail for better debugging

gemini-code-assist · 2026-01-08T16:40:39Z

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reliability of the model pool's startup process and GPU resource management. It addresses potential GPU deadlocks by ensuring resources are properly released upon worker failure and improves startup stability through staggered worker launches and an initial grace period for model loading. Additionally, it refines health check mechanisms for faster issue detection and provides better debugging insights by logging worker stderr.

Highlights

GPU Cleanup: Ensures proper release of GPUs when model workers fail to start, preventing deadlocks and resource leaks.
Worker Launch Reliability: Implements a 5-second stagger delay between worker launches to mitigate resource contention during startup, improving stability.
Health Check Grace Period: Introduces a 30-second initial grace period before health checks begin, allowing sufficient time for models to load without premature failure.
Faster Health Checks: Reduces the health check interval from 5 seconds to 2 seconds for quicker detection of worker issues.
Improved Debugging: Adds comprehensive logging of stderr output when workers fail, providing better diagnostic information for troubleshooting.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several valuable improvements to the model pool's startup and cleanup logic. The fix for the GPU deadlock by ensuring GPUs are released when workers fail is critical and correctly implemented by using _evict_instance. The additions of a stagger delay, an initial grace period, and enhanced stderr logging for failed workers significantly improve the reliability and debuggability of the test infrastructure. The code is well-structured and the changes are clear. I have one suggestion to further improve the robustness of the new stderr reading logic to prevent potential hangs in the test framework.

…ovements (#16745)

slin1237 requested review from CatherineSue and key4ng as code owners January 8, 2026 16:40

github-actions Bot added the model-gateway label Jan 8, 2026

slin1237 added the run-ci label Jan 8, 2026

gemini-code-assist Bot reviewed Jan 8, 2026

View reviewed changes

Comment thread sgl-model-gateway/e2e_test/infra/model_pool.py

[smg][ci] remove unused e2e_test/utils.py

0191534

slin1237 merged commit 8a45a9c into main Jan 8, 2026
61 of 63 checks passed

slin1237 deleted the smg-ci-n/28 branch January 8, 2026 16:55

hnyls2002 pushed a commit that referenced this pull request Jan 8, 2026

[smg][ci] fix model pool GPU cleanup and add startup reliability impr…

a196471

…ovements (#16745)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[smg][ci] fix model pool GPU cleanup and add startup reliability improvements#16745

[smg][ci] fix model pool GPU cleanup and add startup reliability improvements#16745
slin1237 merged 2 commits intomainfrom
smg-ci-n/28

slin1237 commented Jan 8, 2026

Uh oh!

gemini-code-assist Bot commented Jan 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slin1237 commented Jan 8, 2026

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant