Add slurm-gke blueprint#4607
Conversation
|
This PR include using the existing NFS server on the controller to distribute slurm key to nodeset running on GKE. If this is accepted, we can close #4562 as it's no longer needed. |
|
/gcbrun |
nick-stroud
left a comment
There was a problem hiding this comment.
Only partial review. I will try to send more by 6pm PT.
9ab8f60 to
9ebb0fe
Compare
|
/gcbrun |
|
I deployed the blueprint from this PR. After a few minutes, gke based nodes went to DOWN state. The slurmd log shows: If we have nodes going down for such reason, than we need our own build of slurmd container with SlurmUser=401 |
This was caused by uid mismatch in the public slinky image. A custom-built image with the correct UID is required to fix this. I will open a follow-up PR with instructions for building this image. |
|
From time to time, deployment is failing, pods are stuck in init stage with message: Also, sometimes ./gcluster deploy has to be run twice, since for the first time below error message appears: |
066b753
|
/gcbrun |
|
/gcbrun |
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.