Skip to content

feat: Implement dynamic machine configurations via Compute Engine API#5426

Merged
SwarnaBharathiMantena merged 32 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/update_machine_info_map
Apr 14, 2026
Merged

feat: Implement dynamic machine configurations via Compute Engine API#5426
SwarnaBharathiMantena merged 32 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/update_machine_info_map

Conversation

@SwarnaBharathiMantena

@SwarnaBharathiMantena SwarnaBharathiMantena commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR modernizes machine configuration and accelerator discovery within the Cluster Toolkit by replacing hardcoded configuration maps with dynamic lookups against the Google Cloud Compute Engine API. In addition, all relevant Terraform module interfaces and Go structures have been updated to accurately reflect general machine specifications (CPUs, memory, GPUs, and TPUs).

Key Changes

Dynamic Machine Configurations via Go SDK:

  • Implemented secure, direct HTTP/REST lookups using the official Google Cloud Compute Engine Go SDK (compute.Service).
  • Replaced static files (like accelerators.json) to natively support the latest GKE machine and accelerator offerings dynamically.
    High-Performance in-Memory Caching:
  • Introduced a thread-safe caching layer (sync.Map) in pkg/config/machine_configs.go to prevent unnecessary API calls and optimize blueprint expansion speeds.

@SwarnaBharathiMantena SwarnaBharathiMantena added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Mar 30, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request centralizes the management of GCE machine type mappings for GPUs and TPUs. By moving these definitions into a shared JSON file, the configuration becomes easier to maintain and can be reused across different parts of the codebase, including Go logic, thereby reducing duplication and potential for configuration drift.

Highlights

  • Centralized Configuration: Introduced a new pkg/config/accelerators.json file to serve as the single source of truth for GPU and TPU machine type mappings.
  • Terraform Refactoring: Updated gpu-definition and tpu-definition Terraform modules to dynamically load accelerator configurations from the new JSON file instead of using hardcoded local maps.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Comment thread pkg/config/accelerators.json Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes GPU and TPU machine type definitions by migrating hardcoded HCL maps from the gpu-definition and tpu-definition modules into a shared JSON configuration file at pkg/config/accelerators.json. A critical issue was identified where several g4-standard machine types (6, 12, and 24) were omitted during the migration, which would result in a breaking change for users of those machine types.

Comment thread pkg/config/accelerators.json Outdated
@SwarnaBharathiMantena SwarnaBharathiMantena marked this pull request as draft March 30, 2026 09:17
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gcbrun

@SwarnaBharathiMantena SwarnaBharathiMantena changed the title Introduce accelerators.json as Single Source of Truth for GCE machine types feat: Use dynamic gcloud commands to fetch machine info for Terraform modules Apr 2, 2026
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors GPU and TPU definitions by replacing hardcoded Terraform maps with a dynamic injection system that fetches machine configurations via gcloud during blueprint expansion. Feedback focuses on improving the robustness of this new mechanism, including handling gcloud errors to preserve offline functionality, using the encoding/json package for safer JSON construction, and relying on API data for TPU counts instead of fragile string parsing. Additionally, suggestions were made to fix an unused import, ensure consistent JSON schemas, and improve the reliability of the command caching logic.

Comment thread pkg/config/machine_configs.go Outdated
Comment thread pkg/config/expand.go Outdated
Comment thread pkg/config/machine_configs.go Outdated
Comment thread pkg/config/machine_configs.go Outdated
Comment thread pkg/config/machine_configs.go Outdated
Comment thread pkg/gcloud/gcloud.go Outdated
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gcbrun

@SwarnaBharathiMantena SwarnaBharathiMantena marked this pull request as ready for review April 3, 2026 03:00
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

/gcbrun

@SwarnaBharathiMantena SwarnaBharathiMantena marked this pull request as draft April 3, 2026 11:31
@SwarnaBharathiMantena SwarnaBharathiMantena marked this pull request as ready for review April 3, 2026 12:55
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/update_machine_info_map branch 2 times, most recently from 9d34f09 to 8237199 Compare April 6, 2026 10:03
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title feat: Use dynamic gcloud commands to fetch machine info for Terraform modules refactor: Implement dynamic machine configurations via Compute Engine Go SDK Apr 7, 2026
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title refactor: Implement dynamic machine configurations via Compute Engine Go SDK feat: Implement dynamic machine configurations via Compute Engine Go SDK Apr 7, 2026
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title feat: Implement dynamic machine configurations via Compute Engine Go SDK feat: Implement dynamic machine configurations via Compute Engine API Apr 8, 2026
Comment thread modules/internal/gpu-definition/main.tf Outdated
Comment thread pkg/config/machine_configs.go Outdated
Comment thread pkg/config/machine_configs.go Outdated
Comment thread modules/internal/tpu-definition/main.tf Outdated
Comment thread pkg/config/machine_configs.go
Comment thread pkg/config/machine_configs.go
kadupoornima
kadupoornima previously approved these changes Apr 10, 2026
cboneti
cboneti previously approved these changes Apr 10, 2026
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/update_machine_info_map branch from 1cf461f to d9e90b0 Compare April 13, 2026 04:47
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor Author

SUCCESS PR-test-gke go/ghpc-cb/c25a922d-e52e-408e-8f93-77c08cdbe7b2
SUCCESS PR-test-gke-a2-highgpu-kueue-onspot go/ghpc-cb/b5b42a75-8775-4b61-b94b-120a32fbd151
SUCCESS PR-test-gke-a3-highgpu-onspot go/ghpc-cb/4726f2ac-17d7-4a5a-a7ad-bd6e9c8097e0
SUCCESS PR-test-gke-a3-ultragpu-onspot go/ghpc-cb/999b6199-eccd-46ba-89d2-9dae7b69014d
SUCCESS PR-test-gke-a4-onspot go/ghpc-cb/b5fc9256-ed81-4602-82a9-cfacd9a0ca6d
SUCCESS PR-test-gke-a4x go/ghpc-cb/22de52f7-9a35-4c1d-bbeb-f97768e3d513
SUCCESS PR-test-gke-g4 go/ghpc-cb/41a18691-d728-4607-b0eb-aa90f2e5663c
SUCCESS PR-test-gke-h4d-onspot go/ghpc-cb/b9c27e66-1210-44de-95cd-89e2da43f8f6
SUCCESS PR-test-gke-inactive-reservation go/ghpc-cb/c8d3275b-6ab4-419f-acda-a413ce71bf0f
SUCCESS PR-test-gke-managed-lustre go/ghpc-cb/06fd9e27-360d-45c3-9f99-d635b1522fc7
SUCCESS PR-test-gke-storage go/ghpc-cb/6411ccae-656e-49f2-ad23-38b72c9d49a6
SUCCESS PR-test-gke-tpu-7x go/ghpc-cb/ea0a6a6d-5014-47f0-8a3c-d2a08a6203a7
SUCCESS PR-test-gke-tpu-v6e go/ghpc-cb/201276b9-d48a-464e-9d88-6313827cc57d
SUCCESS PR-test-ml-gke go/ghpc-cb/07bf272c-7c77-4ff5-b0d6-c5ae44ac90d4
SUCCESS PR-test-ml-gke-e2e go/ghpc-cb/5e22eb1f-c290-40ea-b20f-d76c4910b90c
SUCCESS PR-test-slurm-gke go/ghpc-cb/74df824f-8b99-4a20-9d05-718eb6ae50c2
FAILURE[2] PR-test-gke-a3-megagpu-onspot go/ghpc-cb/bb7359b1-43cc-49f1-87c2-d7b18a8c9dba
FAILURE[2] PR-test-gke-managed-hyperdisk go/ghpc-cb/dfe70a64-bdc3-4b6c-a707-13614429fab3
------- TOTAL:18 | SUCCESS: 16 | FAILURE: 2

Comment thread pkg/config/machine_configs.go
@SwarnaBharathiMantena SwarnaBharathiMantena merged commit 8b0b811 into GoogleCloudPlatform:develop Apr 14, 2026
32 of 72 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants