Skip to content

Migration of jobset from static manifests to helm chart and upgrading version to 0.10.1#4765

Merged
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubham/jobset-on-helm
Oct 23, 2025
Merged

Migration of jobset from static manifests to helm chart and upgrading version to 0.10.1#4765
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubham/jobset-on-helm

Conversation

@shubpal07

@shubpal07 shubpal07 commented Oct 15, 2025

Copy link
Copy Markdown
Contributor

This PR migrates the JobSet installation from a static YAML manifest to the official Helm chart.

Key Changes:

  • Replaced the kubectl apply module for JobSet with our generic helm_install module.
  • Updated the chart source to the official OCI repository: oci://registry.k8s.io/jobset/charts
  • Created a jobset-helm-values.yaml to explicitly configure the necessary tolerations and resource requests.

This change improves maintainability by aligning with upstream best practices and eliminating our reliance on a custom-patched static manifest. The Helm-based approach also makes the customization(s) explicit, clean, and easy to manage.

NOTE: Regarding jobset version format: As per official jobset installation guide the version is passed like this v0.10.1. But since we use helm charts to install jobset, jobset helm installation guide use just the version number and omits the v like 0.10.1. Hence when we pass jobset version, we use just the version number

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 self-assigned this Oct 15, 2025
@shubpal07 shubpal07 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Oct 15, 2025
@shubpal07 shubpal07 force-pushed the shubham/jobset-on-helm branch from 03499dd to d0b86ef Compare October 16, 2025 04:39
@shubpal07

Copy link
Copy Markdown
Contributor Author

Integration tests of GKE A3 ultra and A3 mega succeeded.

Comment thread modules/management/kubectl-apply/variables.tf Outdated
Comment thread modules/management/kubectl-apply/variables.tf
Comment thread community/examples/xpk-n2-filestore/xpk-n2-filestore.yaml
@shubpal07

Copy link
Copy Markdown
Contributor Author

Ran relevant integration test using babysit tool. Refer details :-

Successful:

PR-Go-1-24-build-test (hpc-toolkit-dev)
PR-Go-1-24-build-test (hpc-toolkit-dev)Successful in 4m — Summary
Required
PR-ofe-venv (hpc-toolkit-dev)
PR-ofe-venv (hpc-toolkit-dev)Successful in 1m — Summary
PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)
PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)Successful in 106m — Summary
PR-test-gke-a3-megagpu (hpc-toolkit-dev)
PR-test-gke-a3-megagpu (hpc-toolkit-dev)Successful in 75m — Summary
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)Successful in 81m — Summary
PR-test-gke-a3-ultragpu-nccl (hpc-toolkit-dev)
PR-test-gke-a3-ultragpu-nccl (hpc-toolkit-dev)Successful in 139m — Summary
PR-test-gke-a4 (hpc-toolkit-dev)
PR-test-gke-a4 (hpc-toolkit-dev)Successful in 112m — Summary
PR-test-slurm-gcp-v6-simple-job-completion (hpc-toolkit-dev)
PR-test-slurm-gcp-v6-simple-job-completion (hpc-toolkit-dev)Successful in 10m — Summary
PR-validation (hpc-toolkit-dev)
PR-validation (hpc-toolkit-dev)Successful in 6m — Summary
Required
Use pre-commit to validate Pull Request / pre-commit (pull_request)
Use pre-commit to validate Pull Request / pre-commit (pull_request)Successful in 9m
Required
Use pre-commit to validate Pull Request / pre-commit-highest-dependencies (pull_request)
Use pre-commit to validate Pull Request / pre-commit-highest-dependencies (pull_request)

Failures:
PR-test-gke-a3-highgpu (hpc-toolkit-dev)Failing after 119m — Summary
PR-test-gke-g4 (hpc-toolkit-dev)
PR-test-gke-g4 (hpc-toolkit-dev)Failing after 36m — Summary
PR-test-gke-h4d (hpc-toolkit-dev)
PR-test-gke-h4d (hpc-toolkit-dev)Failing after 74m — Summary
PR-test-slurm-gke (hpc-toolkit-dev)
PR-test-slurm-gke (hpc-toolkit-dev)Failing after 13m — Summary

@shubpal07

Copy link
Copy Markdown
Contributor Author

Investigation on failure builds:

  1. PR-test-gke-a3-highgpu (hpc-toolkit-dev) -> failed because of NCCL test Avg. bus bandwidth < threshold. (Have been failing on prod too -> Not linked to changes in the PR)

  2. PR-test-gke-g4 (hpc-toolkit-dev) -> Failing because of this error: ERROR: (gcloud.compute.os-login.ssh-keys.add) FAILED_PRECONDITION: Login profile size exceeds 32 KiB. Delete profile values to make additional space. Not directly linked to PR changes. Failed in prod too.

  3. PR-test-gke-h4d (hpc-toolkit-dev) -> Failed because of capacity issues, and another build attempt failed because of timeout while waiting for gke-job to complete. (could see the same issue in some builds in prod as well). However logs suggest jobset chart was successfully created.

  4. PR-test-slurm-gke (hpc-toolkit-dev) -> Because of provider mismatch. (failures present in prod as well). NOT directly linked with changes in the PR

@shubpal07 shubpal07 changed the title Migration of jobset from static manifests to helm chart Migration of jobset from static manifests to helm chart and upgrading version to 0.10.1 Oct 17, 2025
@shubpal07 shubpal07 marked this pull request as ready for review October 17, 2025 14:48
@shubpal07 shubpal07 requested review from a team and samskillman as code owners October 17, 2025 14:48
@shubpal07 shubpal07 force-pushed the shubham/jobset-on-helm branch from 46aa824 to 44f2390 Compare October 17, 2025 14:51
@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor

This will cause issues for users who have been using specific versions lower than v0.8.2. Should this be marked with release-breaking-changes label as some of the jobset versions will no longer be supported?

@shubpal07

Copy link
Copy Markdown
Contributor Author

Integration test suite results after commit 44f2390

Passed:

PR-Go-1-24-build-test (hpc-toolkit-dev)
PR-Go-1-24-build-test (hpc-toolkit-dev)Successful in 3m — Summary
PR-ofe-venv (hpc-toolkit-dev)
PR-ofe-venv (hpc-toolkit-dev)Successful in 1m — Summary
PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)
PR-test-gke-a2-highgpu-kueue (hpc-toolkit-dev)Successful in 6129m — Summary
PR-test-gke-a3-megagpu (hpc-toolkit-dev)
PR-test-gke-a3-megagpu (hpc-toolkit-dev)Successful in 6062m — Summary
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)
PR-test-gke-a3-ultragpu (hpc-toolkit-dev)Successful in 6063m — Summary
PR-test-gke-a3-ultragpu-nccl (hpc-toolkit-dev)
PR-test-gke-a3-ultragpu-nccl (hpc-toolkit-dev)Successful in 6099m — Summary
PR-test-gke-a4 (hpc-toolkit-dev)
PR-test-gke-a4 (hpc-toolkit-dev)Successful in 6100m — Summary
PR-test-gke-g4 (hpc-toolkit-dev)
PR-test-gke-g4 (hpc-toolkit-dev)Successful in 6090m — Summary
PR-test-gke-h4d (hpc-toolkit-dev)
PR-test-gke-h4d (hpc-toolkit-dev)Successful in 6058m — Summary
PR-test-slurm-gcp-v6-simple-job-completion (hpc-toolkit-dev)
PR-test-slurm-gcp-v6-simple-job-completion (hpc-toolkit-dev)Successful in 10m — Summary

Failed:

TASK [Ensure average bus bandwidth is >= 25 GB/s].

This task has been failing in prod as well since long. No direct links with this PR

Could not retrieve the list of available versions for provider
Step #1 - "slurm-gke": hashicorp/google: no available releases match the given constraints.

This task has been failing in prod as well since long. No direct links with this PR

@shubpal07 shubpal07 added the release-breaking-changes Prevents "smooth" re-deploy across versions label Oct 22, 2025
Making jobset version 0.10.1 as default
removing static jobset manifests

Supporting jobset minversion till 0.9.0
@shubpal07 shubpal07 force-pushed the shubham/jobset-on-helm branch from b16d735 to bc904b8 Compare October 22, 2025 17:28
Comment thread community/examples/xpk-n2-filestore/xpk-n2-filestore.yaml

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shubpal07 shubpal07 merged commit 765c491 into GoogleCloudPlatform:develop Oct 23, 2025
16 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-breaking-changes Prevents "smooth" re-deploy across versions release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants