Skip to content

feat: Add explicit opt-in Quota Availability Validator#5422

Merged
kvenkatachala333 merged 2 commits into
GoogleCloudPlatform:developfrom
kvenkatachala333:quota_imp
Apr 23, 2026
Merged

feat: Add explicit opt-in Quota Availability Validator#5422
kvenkatachala333 merged 2 commits into
GoogleCloudPlatform:developfrom
kvenkatachala333:quota_imp

Conversation

@kvenkatachala333

Copy link
Copy Markdown
Member

This PR introduces a Quota Availability Validator to the Cluster Toolkit as an explicit, opt-in feature. It enables a "fail fast" mechanism by verifying resource capacity before deployment, without causing regressions or unexpected latency for existing blueprints.

Key Features

  • Explicit Opt-in Design: Disabled by default. Only executes when explicitly listed in your blueprint YAML.
  • Zero Regression for Existing Blueprints: No unexpected latency (1–3s) or permission blockers (e.g., compute.projects.get) for users who don't need real-time quota checks.
  • Real-Time API Capability: Integrates with the Compute Engine Quotas API to aggregate and check module requirements.

How to Enable in Blueprint

validators:
- validator: test_quota_availability
  inputs:
    project_id: $(vars.project_id)
    region: $(vars.region) 

Resource Coverage

  • Compute (CPUs): Family-specific metrics (e.g., C3_CPUS, H100_CPUS).
  • GPUs: Maps accelerator types (A100, H100, L4, etc.) to regional/global metrics (including GPUS_ALL_REGIONS).
  • Storage: PD-Standard, SSD, Balanced, Extreme, and Hyperdisk Balanced (including IOPS/Throughput).
  • Specialty services: Filestore capacity and TPU core requirements.
  • Networks: NETWORKS and SUBNETWORKS global quotas.

Resiliency & Performance

  • Exponential Backoff: Up to 5 retries for rate limits (429) or transient errors.
  • Regional Caching: In-memory caching for projects and regions to minimize duplicate API calls.
  • Unit Tests: Dedicated tests in pkg/validators/quota_test.go with mock GCP clients.

@kvenkatachala333 kvenkatachala333 added the release-chore To not include into release notes label Mar 29, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a new, optional validator to the Cluster Toolkit designed to proactively check resource availability against GCP quotas. By integrating directly with the Compute Engine Quotas API, it provides a 'fail-fast' mechanism that helps users identify potential deployment blockers early. The implementation is designed to be non-intrusive, ensuring that existing workflows remain unaffected unless the validator is explicitly enabled in the blueprint.

Highlights

  • New Quota Availability Validator: Introduced an opt-in 'test_quota_availability' validator that checks resource capacity against GCP quotas before deployment.
  • Resilient API Integration: Implemented a GCP client with exponential backoff for handling rate limits (429) and transient errors, along with in-memory caching for performance.
  • Comprehensive Resource Coverage: Added support for validating quotas across Compute (CPUs/GPUs), Storage (PD/Hyperdisk), Filestore, TPUs, and Network resources.
  • Testing and Tooling: Added unit tests with mock GCP clients and updated validation scripts to exclude this new validator from existing golden copy tests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new quota availability validator for GCP resources, including support for CPUs, GPUs, Disks, Filestore, and TPUs. The implementation includes a new GCPQuotaClient with retry logic and caching, along with comprehensive unit tests. Feedback is provided regarding the optimization of the accelerator metric map by moving it to a package-level variable and improving error handling when evaluating TPU preemption settings to avoid silent failures on unknown values.

Comment thread pkg/validators/quota.go Outdated
Comment thread pkg/validators/quota.go Outdated
@kvenkatachala333

Copy link
Copy Markdown
Member Author

gke-a3-highgpu, ml-a3-highgpu-slurm, gke-h4d >> Reservations have been removed from dev project, hence skipped
slurm-gcp-v6-rocky8, slurm6-tpu, gke-a4x >> couldn't find zone to deploy and daily tests have been disabled for the same

Hence skipped these 6 tests, rest all are successful

Comment thread pkg/validators/quota.go
Comment thread pkg/validators/quota.go

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@kvenkatachala333 kvenkatachala333 marked this pull request as ready for review April 23, 2026 10:18
@kvenkatachala333 kvenkatachala333 requested a review from a team as a code owner April 23, 2026 10:18
@kvenkatachala333 kvenkatachala333 merged commit ee50893 into GoogleCloudPlatform:develop Apr 23, 2026
80 of 87 checks passed
@kvenkatachala333 kvenkatachala333 deleted the quota_imp branch April 23, 2026 10:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants