Integrate CHS to GKE and Slurm A3U and A4 Daily Tests#5335
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the Cluster Health Scanner (CHS) into the existing daily integration test framework. The primary goal is to enhance the robustness of GKE cluster validations by adding automated checks for GPU diagnostics and network performance. This involves a new Ansible playbook for CHS setup and execution, along with necessary updates to Cloud Build configurations to orchestrate these new health checks, ensuring the stability and performance of the tested environments. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces integration tests for the Cluster Health Scanner (CHS). The changes involve adding a new Ansible playbook to set up and run CHS, and updating several Cloud Build configurations to trigger these tests. My review focuses on improving the new test playbook for better security, reproducibility, and maintainability by addressing hardcoded values and insecure practices. I've also identified some inconsistencies in the Cloud Build configurations and a potentially problematic commented-out step that should be addressed. Several comments were enhanced with references to existing repository rules to ensure consistency and best practices.
6717e2e to
ceca704
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request integrates the Cluster Health Scanner (CHS) into the daily integration tests for GKE and Slurm clusters, which is a valuable addition for automated diagnostics. The implementation introduces a new Ansible playbook for CHS and updates the Cloud Build configurations accordingly.
My review has identified a few issues that need attention:
- A critical merge conflict in
tools/cloud-build/daily-tests/tests/ml-a3-ultragpu-onspot-slurm.ymlthat must be resolved. - Several build configuration files contain a hardcoded project ID, which should be replaced with the
${PROJECT_ID}variable for better portability. - The new Ansible playbook
test-chs.ymlhas some areas for improvement regarding security (insecure directory permissions), maintainability (brittle git refspec), and correctness (a mismatch in the NCCL performance threshold compared to the PR description).
Please address these points to ensure the stability, security, and maintainability of the new testing infrastructure.
…ed in following commit)
sarthakag
left a comment
There was a problem hiding this comment.
Please squash and merge. Thanks!
a7b8c8d
into
GoogleCloudPlatform:develop
This pull request integrates CHS to the existing daily integration test framework for GKE A3 Ultra and A4. The goal is to provide automated GPU diagnostics and network performance validation as part of the post-deployment test suite.
Key Features & Changes:
test-chs.yml): Implements a full setup and test cycle, including:CHS_REPOrepository URL via Secret Manager and pass it to the integration test playbooks.