Allow parallel containers for TPU7x#5612
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances GKE job submissions for TPU v7 and v7x by defaulting to a parallel container configuration. This change optimizes resource utilization for these specific TPU architectures. The implementation includes necessary updates to the orchestrator, resource resolution logic, and job templates, while providing a safety mechanism via a new CLI flag to disable the feature if required. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for parallel containers on GKE, primarily for TPU v7/v7x workloads, by adding a --gke-disable-parallel-containers flag and updating the JobSet template to handle multiple containers per VM. The changes include logic to split TPU resource limits and provide corresponding documentation and unit tests. Review feedback suggests making the resource division logic more generic to include CPU and GPU limits and replacing fragile string-based accelerator identification with a more robust fail-fast mechanism.
91a71e8 to
290357f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the job submission process by replacing AcceleratorType with ComputeType and MachineType, and splitting numSlicesOrNodes into numNodes and numSlices. It introduces support for parallel containers on GKE, updates the orchestrator to handle hardware requirements more dynamically with caching, and updates documentation and tests accordingly. I have no feedback to provide as there were no review comments.
52ddf45 to
08a01ea
Compare
08a01ea to
b7c72a0
Compare
PR enables TPU 7x GKE job submission to create two parallel containers by default.
Also, added a flag
--gke-disable-parallel-containerto disable this feature if required.Highlights
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.