Clean up custom spot VM variables during standard fallback#5697
Conversation
…aily test provisioning
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the reliability of the integration test suite by ensuring that when the system falls back from Spot to on-demand VM provisioning, all relevant configuration variables are properly sanitized. By removing these stale flags, the deployment process avoids requesting unavailable Spot capacity, preventing resource exhaustion errors during standard fallback scenarios. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates several daily test build configurations to ensure machine-specific spot VM variables are removed from the variables file when spot provisioning is disabled. The reviewer identified that the sed commands used for this cleanup are fragile because they rely on exact string matching. It is recommended to use a more flexible regular expression to handle whitespace variations and a generic pattern to cover all *_enable_spot_vm variables, which would enhance the maintainability and robustness of the scripts.
|
TPUv6e and TPU7x also supports standard provisioning model. Please check and add if we need a fallback login on below tests :
|
…election and standard provisioning models
gke-tpu-v6e: The fallback logic is already implemented. It successfully detects a standard fallback, sets ENABLE_SPOT="false", and cleans up the spot-related fields in gke-tpu-v6e.yml. gke-tpu-7x: Added the standard provisioning fallback and zone lookup to match v6e exactly. The script now: |
…c zone selection and standard provisioning models" This reverts commit c07d0fe.
Reverted the change for tpu 7x as For TPU 7x (Only GKE Quota Exists) |
5f83ce2
into
GoogleCloudPlatform:develop
This CL addresses a bug in the integration tests where the Standard fallback mechanism (which runs standard on-demand VMs if Spot capacity is unavailable) would fail.
Specific spot-related variables passed under cli_deployment_vars (like a3_enable_spot_vm: true) were left untouched in the variables files during standard fallback. This caused the compute nodes to still request Spot instances in zones where Spot capacity was exhausted, resulting in Resource exhausted deployment failures.
We now clean up or update all custom Spot VM variables when falling back to standard VM provisioning.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.