Skip to content

feat: full_quota_impl#5140

Merged
kvenkatachala333 merged 14 commits into
GoogleCloudPlatform:developfrom
kvenkatachala333:quota-clean
Mar 12, 2026
Merged

feat: full_quota_impl#5140
kvenkatachala333 merged 14 commits into
GoogleCloudPlatform:developfrom
kvenkatachala333:quota-clean

Conversation

@kvenkatachala333

@kvenkatachala333 kvenkatachala333 commented Jan 28, 2026

Copy link
Copy Markdown
Member

This PR introduces a comprehensive Quota Availability Validator to the Cluster Toolkit as an explicit, opt-in feature. While the validator enables a "fail fast" mechanism by verifying resource quotas before deployment, it is not enabled by default. Users must explicitly configure it in their blueprint to ensure the target Google Cloud project has sufficient capacity, thereby preventing mid-deployment failures due to quota exhaustion.

Explicit Validation (Opt-in Approach)

Based on feedback and to ensure the tool remains flexible, we have implemented this validator as an explicit opt-in feature. It is disabled by default.

By requiring explicit configuration, we avoid:

  1. Unexpected Latency: Deployment only incurs the 1–3 second validation time when requested.
  2. Permission Issues: Users who do not have the required IAM permissions (like compute.projects.get) for quota checks will not face errors unless they choose to use the validator.

How to Enable in Blueprints

To use the quota validator, you must explicitly add the following block to your blueprint YAML under the validators section:

Example: Adding Explicit Quota Validation to your Blueprint

validators:

  • validator: test_quota_availability
    inputs:
    project_id: $(vars.project_id)
    region: $(vars.region)

Resource Coverage

The validator calculates the sum of requirements across all modules in the blueprint and compares them against real-time data from the Compute Engine Quota API.

Compute (CPUs): Supports family-specific metrics (e.g., C3_CPUS, C4_CPUS, H100_CPUS) and handles preemptible/spot resource prefixes.
GPUs: Maps accelerator types (A100, H100, L4, etc.) to specific regional and global metrics (including GPUS_ALL_REGIONS).
Storage: Validates capacity for PD-Standard, SSD, Balanced, Extreme, and Hyperdisk Balanced (including IOPS and Throughput).
Specialty Services: Includes Filestore capacity and TPU (v2, v3) core requirements.
Network: Validates NETWORKS and SUBNETWORKS global quotas.

Resiliency Features

  1. Exponential Backoff: Retries API calls (up to 5 times) in case of rate limits (429) or transient 5xx errors.
  2. Caching: Caches project and region information to minimize redundant API calls across multiple modules.
  3. Comprehensive Testing: Includes a full test suite (pkg/validators/quota_test.go) utilizing a mock GCP client.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @kvenkatachala333, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical pre-deployment validation step to the HPC Toolkit. It implements a robust system for checking Google Cloud Platform resource quotas, ensuring that the resources requested by a blueprint are available in the target project and region before any actual infrastructure provisioning begins. This significantly improves the reliability of deployments by catching quota-related issues early.

Highlights

  • New Quota Validator Implementation: Introduced a comprehensive GCP quota validator (pkg/validators/quota.go) that proactively checks resource requirements against available project and regional quotas before deployment. This helps prevent failures due to insufficient resources.
  • Integration into Deployment Workflow: The new quota validator is integrated into the doDeploy function in cmd/deploy.go, ensuring that quota checks are performed early in the deployment process to 'fail fast' if issues are detected.
  • Resource Requirement Collection: The validator intelligently collects resource requirements by walking through blueprint modules, accounting for various settings like machine types (CPUs, GPUs, including family-specific and preemptible types), disk sizes and types (e.g., Hyperdisk, SSD, standard), network resources, and TPU configurations.
  • Robust GCP API Interaction: Includes a retryCall mechanism for GCP API interactions to handle transient errors and rate limits, along with enhanced error handling in handleClientError to provide more specific feedback for common GCP API issues like permission denied (403) and rate limit exceeded (429).
  • Comprehensive Unit Testing: A new test file (pkg/validators/quota_test.go) has been added with a mock GCP client to thoroughly test the collectRequirements function across a wide range of module configurations and resource types.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kvenkatachala333 kvenkatachala333 added the release-chore To not include into release notes label Jan 28, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive quota validator, which is a great feature for preventing deployment failures due to insufficient quota. The implementation is thorough, covering a wide range of Google Cloud resources. I appreciate the robust design, including the use of API call retries with exponential backoff and caching to improve performance.

I've identified a few areas for improvement, mainly related to code robustness and maintainability. My comments focus on handling potential errors more consistently, improving efficiency by avoiding repeated computations, and increasing code clarity by removing magic numbers and redundant logic. Overall, this is a solid contribution.

Comment thread pkg/validators/quota.go Outdated
Comment thread pkg/validators/quota.go Outdated
Comment thread pkg/validators/quota.go Outdated
Comment thread pkg/validators/quota.go Outdated
@kvenkatachala333

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive quota validator, which is a great feature for preventing deployment failures due to insufficient quotas. The implementation is thorough, covering a wide range of Google Cloud resources. The code is well-structured, and the inclusion of unit tests is excellent. I've identified a few areas for improvement, mainly around code clarity, robustness, and removing unused code. These are detailed in the specific comments. Overall, this is a solid contribution.

Comment thread pkg/validators/quota.go
Comment thread pkg/validators/quota.go
Comment thread pkg/validators/quota.go
Comment thread pkg/validators/quota.go
Comment thread pkg/validators/quota.go Outdated
Comment thread pkg/validators/quota_test.go
LAVEEN
LAVEEN previously approved these changes Mar 2, 2026

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

LAVEEN
LAVEEN previously approved these changes Mar 11, 2026
Comment thread pkg/validators/quota.go Outdated
@kvenkatachala333 kvenkatachala333 marked this pull request as ready for review March 12, 2026 05:18
@kvenkatachala333 kvenkatachala333 requested review from a team and samskillman as code owners March 12, 2026 05:18
@kvenkatachala333 kvenkatachala333 merged commit 38fc190 into GoogleCloudPlatform:develop Mar 12, 2026
16 of 79 checks passed
Neelabh94 added a commit that referenced this pull request Mar 12, 2026
scaliby pushed a commit to scaliby/cluster-toolkit that referenced this pull request Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants