Skip to content

Release candidate: v1.68.0#4746

Merged
ankitkumar-quad merged 126 commits into
mainfrom
release-candidate
Oct 10, 2025
Merged

Release candidate: v1.68.0#4746
ankitkumar-quad merged 126 commits into
mainfrom
release-candidate

Conversation

@ankitkumar-quad

Copy link
Copy Markdown
Contributor

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

nick-stroud and others added 30 commits September 17, 2025 03:31
Update nvidia DRA driver version to v25.3.0
Update the blueprints using managed_lustre vars to 36T/500MB Tier
Refactoring in gke persistent module
Updated A3-mega and A4-high Slurm blueprints to adopt nvidia add repository scirpt.
update terraform-provider version to 7.3.0
Remove superfluous addition chs logs to cloud ops config
downloading  libnccl2 and libnccl-dev for a3u and a4h
Add nvidia-imex-* to list of held packages
When changing startup script content or changing partitions, following
error may occur:
╷
│ Error: Provider produced inconsistent final plan
│
│ When expanding the plan for module.slurm_controller.module.slurm_files.google_storage_bucket_object.nodeset_config["flexnodeset"] to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/google" produced an invalid new value for .md5hash: was known, but now unknown.
│
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

With this change this is no longer the case
Fixing output for gke-storage module
Add mufaqam-gcl to cluster-toolkit-writers.json
bytetwin and others added 11 commits October 8, 2025 13:16
Updating google provider version upper bound to 7.6.0 - latest version
Update GKE cluster and firewall rules module versions to use recent google provider
…add-filestore-pvc

Add Filestore, PV, and sample job template snippets to the GKE H4D blueprint
Add NUMA-aware scheduling in GKE clusters (enabled for G4)
Migrate Kueue installation to use Helm chart
@ankitkumar-quad ankitkumar-quad requested review from a team and samskillman as code owners October 9, 2025 10:13
@ankitkumar-quad ankitkumar-quad added the release-chore To not include into release notes label Oct 9, 2025
arpit974 and others added 6 commits October 9, 2025 10:38
Updated the instance_image.family in a3ultra-vm.yaml to use ubuntu-accelerator-2204-amd64-with-nvidia-570 instead of nvidia-550 for improved compatibility and performance.
This reverts commit 4ff648b, reversing
changes made to 51bb92e.
This reverts commit 721983b, reversing
changes made to 4f8b592.
This pull request makes several updates to the GKE cluster and network modules, primarily focused on removing NUMA-aware scheduling support and aligning Terraform module and provider versions for improved compatibility. The changes simplify the configuration and ensure consistent dependency management across modules.

**Removal of NUMA-aware scheduling support:**
* Removed the `enable_numa_aware_scheduling` variable and all related configuration from the GKE cluster module, including the `kubelet_config` block and references in documentation and example files. [[1]](diffhunk://#diff-7939cd594b53ae6e59dae4629a32d7558e7c23123919d7b6e469ac18a57adddcL244-L259) [[2]](diffhunk://#diff-e54397224c9be21ab0ad72546e3d818fd2a4921bf593b2b6d7e881e6fc1d56e6L528-L533) [[3]](diffhunk://#diff-35b044e2245368feb59f14b7a63621200c0df5f4245b426552a09b8329705507L158) [[4]](diffhunk://#diff-e6090e2163c0286245ffc70056c158ea25acdeab329b5d21352fb007f80f4c73L125)

**Module and provider version alignment:**
* Updated the required versions for the `google` and `google-beta` Terraform providers from `>= 7.2` to `>= 6.16` in both the `versions.tf` and documentation files to standardize provider requirements. [[1]](diffhunk://#diff-b8e991c0f592027d61744d232494249832632ecc529153eac609f9e70444b471L21-R25) [[2]](diffhunk://#diff-35b044e2245368feb59f14b7a63621200c0df5f4245b426552a09b8329705507L106-R122)
* Changed the version constraints for the `workload_identity` module from `>= 40.0` to `~> 34.0` for compatibility, reflected in both code and documentation. [[1]](diffhunk://#diff-7939cd594b53ae6e59dae4629a32d7558e7c23123919d7b6e469ac18a57adddcL412-R396) [[2]](diffhunk://#diff-35b044e2245368feb59f14b7a63621200c0df5f4245b426552a09b8329705507L106-R122)
* Updated the version constraint for the `firewall_rule` module from `~> 12.0` to `~> 9.0` in both code and documentation for consistency. [[1]](diffhunk://#diff-bd07c7386bc0355d11578ce911bbc9a34a40f078b6f41fc0a8230d9b74eec28fL54-R54) [[2]](diffhunk://#diff-04a94d2869736107d8d67616c00f4e89cea5605aaec349c3d040b61be9cd1d0dL86-R86)
·
Updated the instance_image.family in a3ultra-vm.yaml to use ubuntu-accelerator-2204-amd64-with-nvidia-570 instead of nvidia-550 for improved compatibility and performance.
@ankitkumar-quad ankitkumar-quad merged commit 285bbc5 into main Oct 10, 2025
61 of 72 checks passed
@ankitkumar-quad ankitkumar-quad deleted the release-candidate branch October 10, 2025 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-chore To not include into release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.