[smg][ci] fix model pool GPU cleanup and add startup reliability improvements#16745
[smg][ci] fix model pool GPU cleanup and add startup reliability improvements#16745
Conversation
…ovements - Fix GPU deadlock: properly release GPUs when workers fail to start - Add 5s stagger delay between worker launches to reduce resource contention - Add 30s grace period before health checks to allow model loading - Reduce health check interval from 5s to 2s for faster detection - Add stderr logging when workers fail for better debugging
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and reliability of the model pool's startup process and GPU resource management. It addresses potential GPU deadlocks by ensuring resources are properly released upon worker failure and improves startup stability through staggered worker launches and an initial grace period for model loading. Additionally, it refines health check mechanisms for faster issue detection and provides better debugging insights by logging worker Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable improvements to the model pool's startup and cleanup logic. The fix for the GPU deadlock by ensuring GPUs are released when workers fail is critical and correctly implemented by using _evict_instance. The additions of a stagger delay, an initial grace period, and enhanced stderr logging for failed workers significantly improve the reliability and debuggability of the test infrastructure. The code is well-structured and the changes are clear. I have one suggestion to further improve the robustness of the new stderr reading logic to prevent potential hangs in the test framework.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.