Skip to content

Add slurm-gke blueprint#4607

Merged
ACW101 merged 16 commits into
GoogleCloudPlatform:developfrom
ACW101:blueprint
Oct 9, 2025
Merged

Add slurm-gke blueprint#4607
ACW101 merged 16 commits into
GoogleCloudPlatform:developfrom
ACW101:blueprint

Conversation

@ACW101

@ACW101 ACW101 commented Sep 4, 2025

Copy link
Copy Markdown
Collaborator

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@ACW101 ACW101 requested review from a team and samskillman as code owners September 4, 2025 05:41
@ACW101

ACW101 commented Sep 4, 2025

Copy link
Copy Markdown
Collaborator Author

This PR include using the existing NFS server on the controller to distribute slurm key to nodeset running on GKE. If this is accepted, we can close #4562 as it's no longer needed.

Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
Comment thread community/examples/slurm-gke/slurm-gke.yaml
@nick-stroud nick-stroud added the release-improvements Added to release notes under the "Improvements" heading. label Sep 9, 2025
@nick-stroud

Copy link
Copy Markdown
Collaborator

/gcbrun

@nick-stroud nick-stroud left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only partial review. I will try to send more by 6pm PT.

Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
Comment thread community/modules/compute/gke-nodeset/README.md
Comment thread community/modules/compute/gke-nodeset/variables.tf
Comment thread community/modules/compute/gke-nodeset/variables.tf Outdated
Comment thread community/modules/compute/gke-nodeset/variables.tf Outdated
Comment thread community/modules/compute/gke-nodeset/variables.tf
Comment thread community/modules/compute/gke-nodeset/main.tf
Comment thread community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl Outdated
Comment thread community/modules/compute/gke-nodeset/templates/nodeset-general.yaml.tftpl Outdated
Comment thread community/modules/compute/gke-partition/variables.tf
@ACW101 ACW101 force-pushed the blueprint branch 5 times, most recently from 9ab8f60 to 9ebb0fe Compare September 12, 2025 17:56
@ACW101 ACW101 requested a review from nick-stroud September 18, 2025 21:04
Comment thread community/modules/compute/gke-nodeset/variables.tf
Comment thread community/examples/slurm-gke/slurm-gke.yaml
@ACW101

ACW101 commented Sep 24, 2025

Copy link
Copy Markdown
Collaborator Author

/gcbrun

Comment thread tools/cloud-build/daily-tests/tests/slurm-gke.yml
Comment thread community/examples/slurm-gke/slurm-gke.yaml
@pawloch00

pawloch00 commented Sep 25, 2025

Copy link
Copy Markdown
Contributor

I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows:

CPUs=8 Boards=1 Sockets=1 Cores=8 Threads=1 Memory=64309 TmpDisk=96515 Uptime=67206 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2025-09-25T08:43:49.012] error: Security violation, ping RPC from uid 981
[2025-09-25T08:43:49.012] error: Do you have SlurmUser configured as uid 981?

If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401

@ACW101

ACW101 commented Sep 25, 2025

Copy link
Copy Markdown
Collaborator Author

I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows:

CPUs=8 Boards=1 Sockets=1 Cores=8 Threads=1 Memory=64309 TmpDisk=96515 Uptime=67206 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2025-09-25T08:43:49.012] error: Security violation, ping RPC from uid 981
[2025-09-25T08:43:49.012] error: Do you have SlurmUser configured as uid 981?

If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401

This was caused by uid mismatch in the public slinky image. A custom-built image with the correct UID is required to fix this. I will open a follow-up PR with instructions for building this image.

pawloch00
pawloch00 previously approved these changes Sep 26, 2025
@pawloch00

pawloch00 commented Sep 30, 2025

Copy link
Copy Markdown
Contributor

From time to time, deployment is failing, pods are stuck in init stage with message:

MountVolume.MountDevice failed for volume "slurm-key-pv" : rpc error: code = DeadlineExceeded desc = context deadline exceeded 
<br class="Apple-interchange-newline">

Also, sometimes ./gcluster deploy has to be run twice, since for the first time below error message appears:

for: "/tmp/608228131kubectl_manifest.yaml": error when patching "/tmp/608228131kubectl_manifest.yaml": PersistentVolume "slurm-key-pv" is invalid: spec.persistentvolumesource: Forbidden: spec.persistentvolumesource is immutable after creation
  core.PersistentVolumeSource{

@nick-stroud nick-stroud self-assigned this Oct 6, 2025
samskillman
samskillman previously approved these changes Oct 6, 2025
samskillman
samskillman previously approved these changes Oct 6, 2025
@ACW101

ACW101 commented Oct 7, 2025

Copy link
Copy Markdown
Collaborator Author

/gcbrun

nick-stroud
nick-stroud previously approved these changes Oct 7, 2025
Comment thread community/examples/slurm-gke/slurm-gke.yaml Outdated
@ACW101

ACW101 commented Oct 8, 2025

Copy link
Copy Markdown
Collaborator Author

/gcbrun

@ACW101 ACW101 merged commit 1f28255 into GoogleCloudPlatform:develop Oct 9, 2025
11 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants