Create a new community scheduler module for Slinky (Slurm on Kubernetes)#3862
Conversation
…s), which follows the standard Helm-based installation mechanism
…pts, and custom partition configs
…ccelerate installations (based on exponential backoffs and dependency timing) and reduce the risk of exceeding context deadlines
…al zone-based volume node affinity conflicts (or the need to use custom regional storage classes)
…nd Slurm nodesets, to improve cluster efficiency and control
…ner dependency management (i.e., avoid potential issues with running-node-dependent namespace finalizers) and streamlined provider configurations
…values and a Kube Prometheus Stack installation (both of which are default/recommended in the Slinky quickstart)
…the HPC Slinky example, for scraping extensive Slurm metrics into Cloud Monitoring (the DIY Kube Prometheus Stack alternative is by-default disabled in the example)
…ts, customization, and usage
…ep) to the module variables, as only the Slurm Exporter needs a small change/override (the default image does not exist), and make a prerequisite shift from Terraform's shallow merge() to Helm's deep values merge
|
@ighosh98 This is ready for review. Keep me posted here (or via internal chat messages) on any additional questions/concerns/steps. |
ighosh98
left a comment
There was a problem hiding this comment.
Could you please add an integration test for this module in the PR?
…proved security (HPC Slinky blueprint example)
…duce initial node pool requirements and associated provisioning times
… specification to a separate file in the HPC Slinky example
…y example, to follow conventions
Co-authored-by: Sam Skillman <samskillman@google.com>
Co-authored-by: Sam Skillman <samskillman@google.com>
|
Thanks @samskillman! Will rework and retest a number of things here, based on Friday's Slinky v0.2.1 release (which FYI is also related to "Why aren't we creating a login service like the quickstart" and "Consider adding a note on how to connect to the cluster.", as the quickstart didn't have a login service in v0.2.0, and the documented path to "connect" was |
…ount (objectViewer→objectAdmin), in line with recent changes to GKE best practices in other blueprints
…odepool scaling structures
…ical specification for exploring multiple nodesets (debug AND h3) and multiple nodes (two per nodeset), while parameterizing these values for easier steady-state setup
…nodeset (minimal, but sufficient for multi-node testing/exploration)
…ed v0.2.0 Slurm Exporter bug workaround (fixed in v0.2.1)
|
@samskillman Reworked, retested, and ready for your review. Some high-level notes:
Tested your |
…cluster connection command
|
After some discussions on this, I'll wait until Slinky v0.3.0 is released, integrate the latest, and re-ping for review. There's some good stuff in v0.3.0 (especially a login node and RWX mount support) that will significantly improve usability, and it seems like the release will come soon. |
As the v0.3.0 release is out, is there any plan to continue this PR? @ndebuhr |
|
/gcbrun |
|
/gcbrun |
samskillman
left a comment
There was a problem hiding this comment.
LGTM. The decision is to move forward and bring in v0.3.0 in later work. Thanks!
3c69786
into
GoogleCloudPlatform:develop
Added a community Slinky (Slurm-on-Kubernetes) module and example.
This PR introduces a new community module and an accompanying example blueprint to enable the deployment of Slinky, a Slurm workload manager implementation on Kubernetes.
The new module handles the installation of necessary components (Cert Manager, Slinky Operator, and Slurm Cluster) via their respective Helm charts onto a target GKE cluster. It allows for customization through Helm values overrides and supports best practice node affinities for Slinky system components and nodepools.
The example blueprint provides a practical example demonstrating how to deploy a Slinky cluster with both a debug node pool and an H3 HPC node pool. The example also includes configuration for monitoring the Slurm cluster using Google Managed Prometheus (GMP) via a PodMonitoring resource.
Design notes:
concat()is used for Helm values, given nested object type safety constraints.The new module and example have been manually tested, pretty extensively, using Terraform v1.11.3 and Packer v1.12.0.
CC @samskillman as FYI, given additional context
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.