Adding GKE TPU DWS Queued Provisioning support for v6e and 7x#5218
Conversation
Summary of ChangesHello @shubpal07, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for GKE TPU Dynamic Workload Scheduler (DWS) Flex Start with Queued Provisioning for TPU v6e and 7x hardware. The changes enable large-scale training jobs to initiate only when the full required topology is secured, enhancing resource management and job reliability. This involves adding new examples, refining core GKE modules, standardizing TPU configurations across the repository, and updating documentation to guide users through the new capabilities. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant new capabilities by adding support for GKE TPU DWS with Queued Provisioning for v6e and 7x hardware. The changes are extensive, including new blueprints, examples, and documentation, as well as thoughtful refactoring of core modules like gke-node-pool and tpu-definition to standardize TPU configuration and improve maintainability. While the core module changes are excellent, the new documentation and example files have several minor issues, including unresolved template variables, duplicated instructions, and inconsistent resource naming. Addressing these documentation and consistency issues, as detailed in the review comments, will significantly improve the usability and clarity of this new feature.
d5fcf09 to
9726971
Compare
SwarnaBharathiMantena
left a comment
There was a problem hiding this comment.
As the examples/README.md file displays all existing examples information, I think it helps to highlight these new blueprints here as well: https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples#gke-consumption-options-
Maybe a statement that highlights that this folder includes A3U, TPU v6e, and TPU 7x examples.
|
Babysit tests results: |
Thanks for mentioning @SwarnaBharathiMantena. Agree. |
…er toolkit Change-Id: I1f8e2443f5e16b5ceb07ac04c0257164766a2bf2 Change-Id: I4a08edd537023e4a483106382dfe0af1d5e7b51a Change-Id: I1a7e5b00a8c684e3158e05e9f3b26f06cb29aa0b Adding example/readme changes Change-Id: Ie73fa56436d9a8d17cffbdf572c94e3ae1d1eab2 Change-Id: Ide0ae3a0b78753048cd359192eeed18b8479ae37
c80af48 to
58a864a
Compare
Change-Id: I5aad03dc9ef9ed455de497aa3df3b1071d3413ea
bf7c3e1
into
GoogleCloudPlatform:develop
This PR implements and standardizes support for GKE TPU Dynamic Workload Scheduler (DWS) Flex Start with Queued Provisioning (QP). It enables queued provisioning for TPU v6e and TPU 7x hardware, ensuring large-scale training jobs only start when the full required topology is secured.
Key Changes
1. New Blueprints and Examples
Added dedicated QP blueprints for TPU v6e and TPU 7x under
examples/gke-consumption-options/dws-flex-start-queued-provisioning/Created a specialized Kueue template (
tpu-dws-queues.yaml.tftpl) tomanage ProvisioningRequestConfig and AdmissionCheck for TPU
resources.
Included E2E test jobs (JobSets) for both hardware generations with
correct annotations (maxRunDurationSeconds), tolerations, and node
selectors.
2. Core Module Enhancements
gke-node-pool:"true" label. While GKE manages the taint, the label is critical
for nodeSelectors in JobSets to reliably target provisioned
resources.
placement specifically for TPUs when using Queued Provisioning.
tpu-definition:labeling.
4. Documentation
including deployment steps, custom job requirements, and
scale-up/down verification.
Verification Results
TPU v6e and 7x cluster.
Cluster Autoscaler scale-up (0 -> 4 nodes) -> Job execution ->
Job completion -> Automatic scale-down to zero.
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.