add gke dws flex start example and integration test#3632
Conversation
| @@ -0,0 +1,108 @@ | |||
| # Obtaining GKE nodes with DWS Flex | |||
There was a problem hiding this comment.
Update to reflect the presence of a blueprint specifically for DWS Flex
There was a problem hiding this comment.
Rewrite the README to tell users to:
- Deploy the blueprint
- Provide instructions on how to run a sample job
Add a notes section, highlight:
- Necessary parameters and conditions on the node pool
- Where to find Kueue configuration and how to modify it
- How to write a job, what should it contain
| outputs: [instructions] | ||
| ``` | ||
|
|
||
| **Step 2**: Create the Kueue resources for the DWS node pool. |
There was a problem hiding this comment.
Reference the file here instead since this is a maintenance overhead
| ``` | ||
|
|
||
| **Step 3**: The jobset needs the following additions. | ||
| (a) Include the label and annotation under the jobset metadata. |
There was a problem hiding this comment.
Point to sample job yaml file as well here.
| ``` | ||
|
|
||
| > [!NOTE] | ||
| > The jobset resource requests and limits must be aligned with the resources under ClusterQueue (Kueue resource). |
There was a problem hiding this comment.
change to "available under"
There was a problem hiding this comment.
updated
| source: modules/management/kubectl-apply | ||
| use: [gke_cluster] | ||
| settings: | ||
| kueue: |
There was a problem hiding this comment.
Set the dws-queues.yaml as Kueue configuration
There was a problem hiding this comment.
Pass the parameter for #num_chips
There was a problem hiding this comment.
Updated
| delegate_to: localhost | ||
| ansible.builtin.command: gcloud container clusters get-credentials {{ deployment_name }} --region {{ region }} --project {{ custom_vars.project }} | ||
|
|
||
| - name: Create the dws queues |
There was a problem hiding this comment.
Remove this once using Kueue config arg
This speeds up the test duration, as well as resolves issues encountered between compatibility of spack-installed mpi/slurm w/ existing slurm enviornment.
Co-authored-by: Tom Downes <tpdownes@users.noreply.github.com>
Also brings up to date with gke-a3-ultragpu
Add GKE DWS Flex Start example with a README, and an integration test.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.