Skip to content

add gke dws flex start example and integration test#3632

Closed
SwarnaBharathiMantena wants to merge 30 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/gke_dws_flex_start_example_integration_test
Closed

add gke dws flex start example and integration test#3632
SwarnaBharathiMantena wants to merge 30 commits into
GoogleCloudPlatform:developfrom
SwarnaBharathiMantena:swarnabm/gke_dws_flex_start_example_integration_test

Conversation

@SwarnaBharathiMantena

Copy link
Copy Markdown
Contributor

Add GKE DWS Flex Start example with a README, and an integration test.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@SwarnaBharathiMantena SwarnaBharathiMantena added the release-key-new-features Added to release notes under the "Key New Features" heading. label Feb 3, 2025
@@ -0,0 +1,108 @@
# Obtaining GKE nodes with DWS Flex

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to reflect the presence of a blueprint specifically for DWS Flex

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the README to tell users to:

  1. Deploy the blueprint
  2. Provide instructions on how to run a sample job

Add a notes section, highlight:

  1. Necessary parameters and conditions on the node pool
  2. Where to find Kueue configuration and how to modify it
  3. How to write a job, what should it contain

Comment thread examples/gke-dws-flex-start/README.md Outdated
outputs: [instructions]
```

**Step 2**: Create the Kueue resources for the DWS node pool.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference the file here instead since this is a maintenance overhead

Comment thread examples/gke-dws-flex-start/README.md Outdated
```

**Step 3**: The jobset needs the following additions.
(a) Include the label and annotation under the jobset metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Point to sample job yaml file as well here.

Comment thread examples/gke-dws-flex-start/README.md Outdated
```

> [!NOTE]
> The jobset resource requests and limits must be aligned with the resources under ClusterQueue (Kueue resource).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to "available under"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread examples/gke-dws-flex-start/gke-dws-flex-start.yaml
Comment thread examples/gke-dws-flex-start/dws-queues.yaml Outdated
source: modules/management/kubectl-apply
use: [gke_cluster]
settings:
kueue:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the dws-queues.yaml as Kueue configuration

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass the parameter for #num_chips

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

delegate_to: localhost
ansible.builtin.command: gcloud container clusters get-credentials {{ deployment_name }} --region {{ region }} --project {{ custom_vars.project }}

- name: Create the dws queues

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this once using Kueue config arg

@SwarnaBharathiMantena SwarnaBharathiMantena requested review from annuay-google and removed request for annuay-google February 4, 2025 03:18
@SwarnaBharathiMantena SwarnaBharathiMantena deleted the swarnabm/gke_dws_flex_start_example_integration_test branch March 28, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.