Skip to content

Create a new community scheduler module for Slinky (Slurm on Kubernetes)#3862

Merged
nick-stroud merged 40 commits into
GoogleCloudPlatform:developfrom
ndebuhr:feat/slinky-scheduler
Aug 8, 2025
Merged

Create a new community scheduler module for Slinky (Slurm on Kubernetes)#3862
nick-stroud merged 40 commits into
GoogleCloudPlatform:developfrom
ndebuhr:feat/slinky-scheduler

Conversation

@ndebuhr

@ndebuhr ndebuhr commented Mar 31, 2025

Copy link
Copy Markdown
Contributor

Added a community Slinky (Slurm-on-Kubernetes) module and example.

This PR introduces a new community module and an accompanying example blueprint to enable the deployment of Slinky, a Slurm workload manager implementation on Kubernetes.

The new module handles the installation of necessary components (Cert Manager, Slinky Operator, and Slurm Cluster) via their respective Helm charts onto a target GKE cluster. It allows for customization through Helm values overrides and supports best practice node affinities for Slinky system components and nodepools.

The example blueprint provides a practical example demonstrating how to deploy a Slinky cluster with both a debug node pool and an H3 HPC node pool. The example also includes configuration for monitoring the Slurm cluster using Google Managed Prometheus (GMP) via a PodMonitoring resource.

Design notes:

  1. The Helm-based approach closely follows the documented "quickstart" Slinky installation.
  2. A lightweight, GCP-native metrics/monitoring system is adopted by default, rather than the Slinky-documented cluster-local Kube Prometheus Stack.
  3. As of 2025/04/03, the quickstart documentation suggests a v0.1.0 installation, but v0.2.0 is implemented as the module default given its improved stability.
  4. The example adopts a zonal implementation, which is common to avoid inter-zonal networking charges and to minimize latency. This includes skipping the GKE Cluster "system node pool", to avoid potential zone-based volume node affinity conflicts (or the need to use custom regional storage classes).
  5. The module is unopinionated on nodeset node affinities, given the need for compute fabric flexibility. However, a module dependency is used for the Slurm system component node affinities, given less need for flexibility and to ensure a node pool remains up during Helm uninstalls (gcluster destroys).
  6. Terraform's concat() is used for Helm values, given nested object type safety constraints.

The new module and example have been manually tested, pretty extensively, using Terraform v1.11.3 and Packer v1.12.0.

CC @samskillman as FYI, given additional context

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • ✅ Fork your PR branch from the Toolkit "develop" branch (not main)
  • ✅ Test all changes with pre-commit in a local branch #
  • ✅ Confirm that "make tests" passes all tests
  • ✅ Add or modify unit tests to cover code changes
  • ✅ Ensure that unit test coverage remains above 80%
  • ✅ Update all applicable documentation
  • ✅ Follow Cluster Toolkit Contribution guidelines #

…s), which follows the standard Helm-based installation mechanism
@ighosh98 ighosh98 self-requested a review April 1, 2025 18:54
ndebuhr added 12 commits April 2, 2025 02:19
…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines
…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)
…nd Slurm nodesets, to improve cluster efficiency and control
…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations
…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)
…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)
…ep) to the module variables, as only the Slurm Exporter needs a small change/override (the default image does not exist), and make a prerequisite shift from Terraform's shallow merge() to Helm's deep values merge
@ndebuhr

ndebuhr commented Apr 4, 2025

Copy link
Copy Markdown
Contributor Author

@ighosh98 This is ready for review. Keep me posted here (or via internal chat messages) on any additional questions/concerns/steps.

@ndebuhr ndebuhr marked this pull request as ready for review April 4, 2025 02:05
@ndebuhr ndebuhr requested review from a team and samskillman as code owners April 4, 2025 02:05

@ighosh98 ighosh98 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an integration test for this module in the PR?

Comment thread community/examples/hpc-slinky.yaml
Comment thread community/examples/hpc-slinky.yaml Outdated
Comment thread community/examples/hpc-slinky.yaml Outdated
Comment thread community/examples/hpc-slinky.yaml
Comment thread community/examples/hpc-slinky.yaml Outdated
Comment thread community/modules/scheduler/slinky/versions.tf
ndebuhr and others added 2 commits April 21, 2025 14:13
Co-authored-by: Sam Skillman <samskillman@google.com>
Co-authored-by: Sam Skillman <samskillman@google.com>
@ndebuhr

ndebuhr commented Apr 21, 2025

Copy link
Copy Markdown
Contributor Author

Thanks @samskillman! Will rework and retest a number of things here, based on Friday's Slinky v0.2.1 release (which FYI is also related to "Why aren't we creating a login service like the quickstart" and "Consider adding a note on how to connect to the cluster.", as the quickstart didn't have a login service in v0.2.0, and the documented path to "connect" was kubectl exec. Glad to see some improvements in v0.2.1 and happy to integrate. Will ping you when this is ready for another round of review

ndebuhr added 7 commits April 21, 2025 19:42
…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints
…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup
…nodeset (minimal, but sufficient for multi-node testing/exploration)
…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)
@ndebuhr

ndebuhr commented Apr 21, 2025

Copy link
Copy Markdown
Contributor Author

@samskillman Reworked, retested, and ready for your review. Some high-level notes:

  • v0.2.1 did indeed fix the v0.2.0 Slurm Exporter issue - removed the workaround.
  • v0.2.1 does not include the new login node and connection system. I'll keep an eye out and submit a PR as soon as v0.2.2 or v0.3.0 is released. Even if we wanted to (which we don't 🙂), building on top of Slinky main or some commit hash would be very messy, as we'd have to circumvent the official publishings in the OCI chart repo.
  • The blueprint previously used a default h3 nodeset of 0 replicas - assuming folks would use the debug nodeset for development and then scale out the h3 nodeset as needed. However, as your issue with the h3 partition highlights, this was not intuitive or well documented. Both the debug nodeset and h3 nodeset have been updated to a default of 2 replicas (both can be overridden with blueprint variables), and the documentation is clarified/expanded.

Tested your sbatch --wrap "hostname" command on debug and h3 partitions. With the aforementioned 0→2 h3 nodeset replica change, they both work fine - just need to check /tmp on the nodeset replica that ran the job (e.g., via kubectl exec, since there is no RWX shared volume between the nodeset replicas and the controller pod.

@samskillman samskillman added the release-key-new-features Added to release notes under the "Key New Features" heading. label Apr 21, 2025
@samskillman samskillman self-assigned this Apr 21, 2025
@samskillman samskillman added the release-new-modules Added to release notes under the "New Modules" heading. label Apr 21, 2025
@ighosh98 ighosh98 self-requested a review April 25, 2025 06:11
@ndebuhr

ndebuhr commented Apr 25, 2025

Copy link
Copy Markdown
Contributor Author

After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon.

@ndebuhr ndebuhr marked this pull request as draft April 26, 2025 20:18
@sharabiani

Copy link
Copy Markdown
Collaborator

After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon.

As the v0.3.0 release is out, is there any plan to continue this PR? @ndebuhr

@nick-stroud

Copy link
Copy Markdown
Collaborator

/gcbrun

@sharabiani

Copy link
Copy Markdown
Collaborator

/gcbrun

@sharabiani sharabiani marked this pull request as ready for review August 8, 2025 05:59
@samskillman samskillman self-requested a review August 8, 2025 12:24

@samskillman samskillman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The decision is to move forward and bring in v0.3.0 in later work. Thanks!

@samskillman samskillman requested a review from sharabiani August 8, 2025 12:32
@nick-stroud nick-stroud merged commit 3c69786 into GoogleCloudPlatform:develop Aug 8, 2025
14 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading. release-new-modules Added to release notes under the "New Modules" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants