Releases: GoogleCloudPlatform/cluster-toolkit
Releases · GoogleCloudPlatform/cluster-toolkit
v1.92.0
What's Changed
Key New Features 🎉
- feat: add ML Diagnostics module and integration for GKE TPU blueprints by @AdarshK15 in #5350
- NAP support on GKE Clusters (gke-cluster module) by @SwarnaBharathiMantena in #5420
- feat: optional infra setup for inference gateway by @jessicaochen in #5453
- feat(slurm): support compact placement with DWS Flex-Start for H4D, A3Ultra and A4 by @parulbajaj01 in #5579
Breaking Changes 🚨
- Transitioning to Slurm Native Auth with resilient workbench keys distribution by @arpit974 in #5695
- default to sauth for newer deployments in h4d and a3mega-gcsfuse blueprints by @arpit974 in #5707
New Modules 🧱
- adding new dns-managed-zone module. by @arpit974 in #5485
- adding new global static ip module. by @arpit974 in #5559
- adding new module for kubernetes namespace. by @arpit974 in #5562
- adding new iap-policy module. by @arpit974 in #5564
- Adding new cloud run module. by @arpit974 in #5567
- adding new redis module. by @arpit974 in #5569
- adding new kubernetes-secret module. by @arpit974 in #5572
- adding new workload_identity_binding module. by @arpit974 in #5574
- adding new scripting module gke-backend-fetcher under community folder. by @arpit974 in #5593
- adding a new helm-upgrade module under community folder. by @arpit974 in #5595
- adding new spanner-migrations runner module under community folder. by @arpit974 in #5597
Module Improvements 🔨
- Adding native K8s annotations and GKE cluster enhancements by @arpit974 in #5610
- Default Kueue config for Pathways by @scaliby in #5628
Improvements 🛠
- [Telemetry] Get blueprint even from deployment directory by @kadupoornima in #5656
- [Telemetry] Capture exit code upon fatal command failures by @kadupoornima in #5658
- (gke) Remove additional network settings from A3U blueprint by @agrawalkhushi18 in #5652
- (gke) Remove additional networks from A4 and A4X family blueprints by @agrawalkhushi18 in #5682
- (gke) Remove additional network settings from TPU v6e,7x and g4 by @agrawalkhushi18 in #5692
- [Telemetry] Add support to merge vars from deployment files and CLI --vars by @kadupoornima in #5694
- [Telemetry] Add support for collection of CPU machines and Default machines when unset in module by @kadupoornima in #5696
- Make Managed lustre default in A3u and A3m series Slurm blueprints by @saara-tyagi27 in #5396
- [Telemetry] Add a retry mechanism to get the GCP Project information to eliminate transient issues by @kadupoornima in #5702
- [Telemetry] Add an atomic flag to ensure telemetry event is not recurrently called by @kadupoornima in #5705
- Pin DCGM to version 4.5.3 by @shubpal07 in #5721
- feat(gke): expose monitoring components as a parameter by @cboneti in #5722
- feat(job submission): Dynamic topology routing for gke jobs by @Neelabh94 in #5664
Deprecations 💤
- Remove hpc-slurm-static blueprint by @kadupoornima in #5672
Version Updates ⏫
- Fix A3 HighGPU test by pinning GKE version to 1.33 to resolve COS incompatibility by @kadupoornima in #5673
- Update minimum required Packer version to 1.15.3 by @AdarshK15 in #5701
Bug fixes 🐞
- fix: Add tpu_topology conditional logic for TPU flex start by @agrawalkhushi18 in #5655
- fix: Update the vpc module output name for additional network by @agrawalkhushi18 in #5690
- fix(slurm): correct vNUMA socket and SMT thread calculations in util.py by @kadupoornima in #5683
- Multi NIC support & cluster ID fix for Slurm controller by @rahimkhan19 in #5563
- [Telemetry] Collect the correct exit code when user intentionally stops deployment (0 instead of 1) by @kadupoornima in #5704
- Clean up custom spot VM variables during standard fallback by @rahimkhan19 in #5697
- fix: accelerator label auto resolution by @Neelabh94 in #5717
Full Changelog: v1.91.0...v1.92.0
v1.91.0
What's Changed
Key New Features 🎉
- Allow parallel containers for TPU7x by @Neelabh94 in #5612
- [Telemetry] Start collecting Telemetry data by adding a new "telemetry" command to GCluster CLI! by @kadupoornima in #5602
New Modules 🧱
- adding new module direct-helm-install in community folder. by @arpit974 in #5578
- adding new module spanner in cluster toolkit. by @arpit974 in #5592
Module Improvements 🔨
- Ensure fully qualified URLs for reservation subblocks by @scaliby in #5452
- Introduce Kueue and Jobset controller resources overrides inputs by @jamOne- in #5581
Improvements 🛠
- [Telemetry] Use GitHub API and local caching for metadata retrieval by @kadupoornima in #5589
- [Telemetry] Add support to collect the Blueprint name by @kadupoornima in #5547
- [Telemetry] Add support to collect the Deployment File name by @kadupoornima in #5539
- [Telemetry] Implement local caching to persistently store user config. Remove Firestore dependency completely by @kadupoornima in #5594
- fix: Update hardware.go for tpu_topology extraction through workload_policy by @agrawalkhushi18 in #5600
- feat: implement lean deployment modules by selective copying by @cboneti in #5482
- Add PriorityClasses to example Kueue configs by @scaliby in #5614
Bug fixes 🐞
- Correctly evaluate Docker credentials prerequisite state by @scaliby in #5607
- fix(slurm): respect visible_core_count in cloud.conf generation by @saara-tyagi27 in #5529
- fix(gke): Missing Pathways Quotas in Kueue by @Neelabh94 in #5645
Full Changelog: v1.90.0...v1.91.0
v1.90.0
What's Changed
Key New Features 🎉
- feat: Add job submission capability by introducing "gcluster job" command by @Neelabh94 in #5431
- Integrate storage profile in GCSFuse by @parulbajaj01 in #5476
Module Improvements 🔨
Improvements 🛠
- [Telemetry] Refactor getModules method to use cached standard modules from firestore by @kadupoornima in #5570
- [Telemetry] Add support to collect Toolkit installation mode by @kadupoornima in #5598
Bug fixes 🐞
- Fix Slurm 25.05 topology.yaml parsing error by @AdarshK15 in #5558
Full Changelog: v1.89.0...v1.90.0
v1.89.0
What's Changed
Key New Features 🎉
- Add DRANET support in Cluster Toolkit by @FIoannides in #5418
Module Improvements 🔨
- (Slurm) Implement dynamic machine configurations via API by @AdarshK15 in #5514
- Add precondition and information on is_reservation_active input variable by @SwarnaBharathiMantena in #5543
Improvements 🛠
- Revamp GKE A3 Mega blueprint and align integration tests by @agrawalkhushi18 in #5483
- [Telemetry] Add support to collect machine type information by @kadupoornima in #5494
- [Telemetry] Add support to collect OS name and OS version information by @kadupoornima in #5502
- [Telemetry] Add support to collect the Terraform version metric by @kadupoornima in #5518
- [Telemetry] Add support to collect the orchestrator type (GKE, Slurm, VM Instance usage) by @kadupoornima in #5523
- Add Fractional G4 Slurm Blueprint by @LAVEEN in #5535
- [Telemetry] Add support to collect GCP Project Number by @kadupoornima in #5528
- [Telemetry] Add support to collect the Billing Account ID metric by @kadupoornima in #5519
- [Telemetry] Add support to collect modules used in blueprint by @kadupoornima in #5534
- [Telemetry] Add support to identify if user is internal or external by @kadupoornima in #5503
- [Telemetry] Set up a Github workflow to cache release metadata by @kadupoornima in #5553
Deprecations 💤
- Deprecate kubectl submodule by @agrawalkhushi18 in #5537
Version Updates ⏫
- Update the kueue version to latest v0.17.1 by @agrawalkhushi18 in #5520
Bug fixes 🐞
Other changes
- Use release branch of CHS in daily tests by @sarthakag in #5522
New Contributors
- @rahimkhan19 made their first contribution in #5546
Full Changelog: v1.88.0...v1.89.0
v1.88.0
Release v1.88.0
What's Changed
Key New Features 🎉
- feat: plumb optional auto monitoring scope by @jessicaochen in #5331
Breaking Changes 🚨
- Reapply "Modify the kubectl-apply manifest helm_release_naming (#5438)" by @agrawalkhushi18 in #5473
Module Improvements 🔨
- feat: Implement dynamic machine configurations via Compute Engine API by @SwarnaBharathiMantena in #5426
Improvements 🛠
- Integrate CHS to GKE and Slurm A3U and A4 Daily Tests by @simrankaurb in #5335
- Modify the kubectl-apply manifest helm_release_naming by @agrawalkhushi18 in #5438
- feat: Automatically add ghpc_creator label to expanded blueprint by @cboneti in #5468
- [Telemetry] Set up a base skeleton framework - Resubmit by @kadupoornima in #5475
- [Telemetry] Add Viper-based User config backed by Firestore DB by @kadupoornima in #5478
- [Telemetry] Add metric implementation to collect flags by @kadupoornima in #5486
- [Telemetry] Add region and zone metrics implementation by @kadupoornima in #5489
New Contributors
- @jessicaochen made their first contribution in #5331
Full Changelog: v1.87.0...v1.88.0
v1.87.0
What's Changed
Key New Features 🎉
- Adding kueue support for GKE A4X-Max by @vikramvs-gg in #5389
- Add Customer Managed Encryption Keys (CMEK) support in Managed Lustre by @parulbajaj01 in #5449
Breaking Changes 🚨
- Migrating install_asapd_lite module to helm by @agrawalkhushi18 in #5410
New Modules 🧱
Module Improvements 🔨
- refactor: Fix pre-commit error in kubectl-apply by @jamOne- in #5427
- feat: Add resource-policy accelerator_topology_mode by @jamOne- in #5393
Improvements 🛠
- Update fielstore tier default by @saara-tyagi27 in #5379
- Revamp GKE A3 High blueprint and align integration tests by @shubpal07 in #5246
Deprecations 💤
- Marking parallelstore deprecated for gcluster warnings by @vikramvs-gg in #5325
Full Changelog: v1.86.0...v1.87.0
v1.86.0
What's Changed
Key New Features 🎉
- feat: Implement and configure GKE Image Streaming (GCFS) at the cluster level. by @raushan2016 in #5387
- Support vGPU (fractional GPU) for G4 GKE by @kadupoornima in #5399
- Support Customer-Managed Encryption Keys (CMEK) in Slurm GCP deployments by @saara-tyagi27 in #5407
Breaking Changes 🚨
- Enable JobSet and Nvidia Data Center monitoring by default by @SikaGrr in #5384
- Migrate kubectl_apply_manifest module to helm by @agrawalkhushi18 in #5282
Module Improvements 🔨
- feat: Add Enable GKE Slice Controller by @jamOne- in #5375
- Pathways cluster config by @FIoannides in #5370
Improvements 🛠
- Upgrade the DCGMI Version to 4.5.2 by @LAVEEN in #5408
- Feat: Automatically derive TPU node counts based on topology and machine type by @SwarnaBharathiMantena in #5386
- Upgrade Debian version in test runner image by @parulbajaj01 in #5400
Version Updates ⏫
- Update Slurm images to 6-12 by @AdarshK15 in #5273
- Bump slurm-gcp tag to 6.12.1 (Slurm 25.11.4) by @AdarshK15 in #5269
Bug fixes 🐞
- Update the TFLint Google ruleset version to 0.30.0 by @SwarnaBharathiMantena in #5401
- enable execution of external prolog/epilog for A4X by @Neelabh94 in #5403
- fix: Null iteration in kubectl-apply module by @sudheer-quad in #5430
New Contributors
- @SikaGrr made their first contribution in #5384
- @FIoannides made their first contribution in #5370
Full Changelog: v1.85.0...v1.86.0
v1.85.0
What's Changed
Key New Features 🎉
- feat(storage): Enable GCS zonal bucket capability with RAPID storage. by @Neelabh94 in #5353
- Support future reservation in name check validator by @saara-tyagi27 in #5252
Breaking Changes 🚨
- Update cloud_dns_config to default to KUBE_DNS (CoreDNS) by @SwarnaBharathiMantena in #5336
Improvements 🛠
- Add Managed Lustre integration in gke a4x-max by @parulbajaj01 in #5337
- Binary dependencies downloading script by @scaliby in #5354
- fix: use cleaned relative path instead of absolute path for local module hash by @mtibben in #5280
- feat: multi-arch build support and README updates by @kvenkatachala333 in #5388
Version Updates ⏫
- Pin shfmt and goimports version to resolve Go version conflict by @kadupoornima in #5365
Bug fixes 🐞
New Contributors
Full Changelog: v1.84.0...v1.85.0
v1.84.0
What's Changed
Key New Features 🎉
- Validate disk type in zone by @saara-tyagi27 in #5232
Version Updates ⏫
- Update gke-versioning in gpu_direct.tf by @agrawalkhushi18 in #5284
Bug fixes 🐞
- Update nccl test script to fix enroot directory issue in A3H by @agrawalkhushi18 in #5324
Full Changelog: v1.83.0...v1.84.0
v1.83.0
What's Changed
Key New Features 🎉
- feat(validations): Add early conditional validation by @AdarshK15 in #5160
- A4x Max BM slurm support. by @arpit974 in #5222
- Adding GKE TPU DWS Queued Provisioning support for v6e and 7x by @shubpal07 in #5218
- feat(validations): Add early required validation by @AdarshK15 in #5166
- Module deprecation warning system by @vikramvs-gg in #5229
- A4X-Max Bare Metal GKE toolkit blueprint by @vikramvs-gg in #5211
Breaking Changes 🚨
- Update and pin terraform version to 1.12.2 by @parulbajaj01 in #5216
- Update wait flag and resolving helm_release deadlock destruction error by @agrawalkhushi18 in #5147
Module Improvements 🔨
- Migrate configure_kueue from gavinbunney to helm by @agrawalkhushi18 in #5129
- Migrate install_gib from kubectl to helm by @agrawalkhushi18 in #5256
Improvements 🛠
- Add reservation name check validator by @saara-tyagi27 in #5185
- Update go files to add timestamps to gcluster logs by @agrawalkhushi18 in #5198
- Pin Dcgm version 4.5.1-1 by @saara-tyagi27 in #5197
- Add support for DualStack (IPv4/IPv6) networks by @DomiKoPL in #5206
Bug fixes 🐞
- Update slurm_cluster_name regex by @saara-tyagi27 in #5261
- Fix SELinux issue in hpc-build-slurm-image blueprint by @AdarshK15 in #5266
- Hotfix: update G4 NVIDIA drivers for kernel 6.17 compatibility by @SwarnaBharathiMantena in #5289
- Hardcode zone in a2high PR test to fix test failures by @kadupoornima in #5305
- Modifying prefix_length for PSA to accomodate sufficient IPs for peering by @vikramvs-gg in #5306
- fix: Update a3m and a3u script to resolve slurm nccl test failure by @agrawalkhushi18 in #5308
New Contributors
Full Changelog: v1.82.0...v1.83.0