Skip to content

TPU support with GKE nodepool module and TPU v4 2x2x2 example blueprint#3817

Merged
SwarnaBharathiMantena merged 1 commit into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/tpu_v4
Apr 4, 2025
Merged

TPU support with GKE nodepool module and TPU v4 2x2x2 example blueprint#3817
SwarnaBharathiMantena merged 1 commit into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/tpu_v4

Conversation

@SwarnaBharathiMantena

@SwarnaBharathiMantena SwarnaBharathiMantena commented Mar 19, 2025

Copy link
Copy Markdown
Contributor

Updates

  • Support TPU in Cluster Toolkit GKE by adding num_slices and tpu_topology
  • TPU v4 2x2x2 example blueprint with Cluster Toolkit GKE

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

Comment thread modules/compute/gke-node-pool/main.tf
Comment thread modules/compute/gke-node-pool/outputs.tf
Comment thread community/examples/gke-tpu-v4/README.md Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4-deployment.yaml Outdated
Comment thread examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml Outdated
@SwarnaBharathiMantena SwarnaBharathiMantena added release-key-new-features Added to release notes under the "Key New Features" heading. release-module-improvements Added to release notes under the "Module Improvements" heading. labels Mar 19, 2025
@SwarnaBharathiMantena

SwarnaBharathiMantena commented Mar 20, 2025

Copy link
Copy Markdown
Contributor Author

Leaving this comment as a note to update kueue config / blueprint for some examples (a3u, a4h).

These changes are part of another PR: #3826

Comment thread examples/hypercompute_clusters/a3u-gke-gcs/a3u-gke-gcs.yaml Outdated
Comment thread modules/compute/gke-node-pool/main.tf
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Comment thread modules/compute/gke-node-pool/variables.tf
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4-deployment.yaml Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4-deployment.yaml Outdated
Comment thread community/examples/gke-tpu-v4/README.md Outdated
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4.yaml Outdated
Comment thread modules/compute/gke-job-template/main.tf
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title TPU v4 example blueprint, and associated changes to modules and examples GKE TPU v4 example blueprint, and associated changes to modules and examples Mar 20, 2025
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title GKE TPU v4 example blueprint, and associated changes to modules and examples Support creation of multiple GKE nodepools, and add TPU v4 example Mar 20, 2025
Comment thread modules/compute/gke-node-pool/variables.tf
@SwarnaBharathiMantena SwarnaBharathiMantena changed the title Support creation of multiple GKE nodepools, and add TPU v4 example TPU support and TPU v4 example with Cluster Toolkit GKE Mar 24, 2025
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/tpu_v4 branch 2 times, most recently from 701f03e to b9a8214 Compare March 26, 2025 09:59
Comment thread modules/compute/gke-node-pool/variables.tf
Comment thread modules/compute/gke-node-pool/reservation_definitions.tf Outdated
Comment thread modules/compute/gke-node-pool/outputs.tf Outdated
Comment thread modules/compute/gke-node-pool/main.tf
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4.yaml Outdated

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you work on documentation changes in a follow up PR?

Comment thread community/examples/gke-tpu-v4/gke-tpu-v4-deployment.yaml Outdated
ighosh98
ighosh98 previously approved these changes Apr 4, 2025
Comment thread modules/compute/gke-node-pool/guest_cpus.tf
Comment thread modules/compute/gke-node-pool/main.tf Outdated
Comment thread modules/compute/gke-node-pool/reservation_definitions.tf Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4.yaml Outdated

@cboneti cboneti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of comments, thanks!

Comment thread modules/compute/gke-node-pool/variables.tf Outdated
Comment thread modules/compute/gke-node-pool/main.tf Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4-deployment.yaml
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4.yaml Outdated
Comment thread community/examples/gke-tpu-v4/gke-tpu-v4.yaml
@SwarnaBharathiMantena SwarnaBharathiMantena force-pushed the swarnabm/tpu_v4 branch 3 times, most recently from 7b3a221 to 615e4fa Compare April 4, 2025 20:22
@SwarnaBharathiMantena

SwarnaBharathiMantena commented Apr 4, 2025

Copy link
Copy Markdown
Contributor Author

Follow-up tasks / PRs related to TPUs will be:

  1. Update the blueprint to run the workload with gke-job-template.
  2. Collect info from GKE teams on the maintenance exclusions and update as required.

@SwarnaBharathiMantena SwarnaBharathiMantena changed the title TPU support and TPU v4 example with Cluster Toolkit GKE TPU support with GKE nodepool module and TPU v4 2x2x2 example blueprint Apr 4, 2025
@SwarnaBharathiMantena SwarnaBharathiMantena merged commit d50eb30 into GoogleCloudPlatform:develop Apr 4, 2025
@SwarnaBharathiMantena SwarnaBharathiMantena deleted the swarnabm/tpu_v4 branch April 8, 2025 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading. release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants