Skip to content

Allow parallel containers for TPU7x#5612

Merged
Neelabh94 merged 1 commit into
GoogleCloudPlatform:developfrom
Neelabh94:parallel_container
May 7, 2026
Merged

Allow parallel containers for TPU7x#5612
Neelabh94 merged 1 commit into
GoogleCloudPlatform:developfrom
Neelabh94:parallel_container

Conversation

@Neelabh94

@Neelabh94 Neelabh94 commented May 7, 2026

Copy link
Copy Markdown
Contributor

PR enables TPU 7x GKE job submission to create two parallel containers by default.

Also, added a flag --gke-disable-parallel-container to disable this feature if required.

Highlights

  • Parallel Container Support: Enabled the creation of two parallel containers by default for TPU v7 and v7x GKE job submissions to improve workload efficiency.
  • New Configuration Flag: Introduced the --gke-disable-parallel-containers flag, allowing users to opt-out of the parallel container feature if necessary.
  • Resource Management: Updated resource calculation logic to correctly split TPU resources across parallel containers.
  • Documentation Updates: Updated the job guide to include the new flag and provided usage examples for disabling parallel containers.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances GKE job submissions for TPU v7 and v7x by defaulting to a parallel container configuration. This change optimizes resource utilization for these specific TPU architectures. The implementation includes necessary updates to the orchestrator, resource resolution logic, and job templates, while providing a safety mechanism via a new CLI flag to disable the feature if required.

Highlights

  • Parallel Container Support: Enabled the creation of two parallel containers by default for TPU v7 and v7x GKE job submissions to improve workload efficiency.
  • New Configuration Flag: Introduced the --gke-disable-parallel-containers flag, allowing users to opt-out of the parallel container feature if necessary.
  • Resource Management: Updated resource calculation logic to correctly split TPU resources across parallel containers.
  • Documentation Updates: Updated the job guide to include the new flag and provided usage examples for disabling parallel containers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for parallel containers on GKE, primarily for TPU v7/v7x workloads, by adding a --gke-disable-parallel-containers flag and updating the JobSet template to handle multiple containers per VM. The changes include logic to split TPU resource limits and provide corresponding documentation and unit tests. Review feedback suggests making the resource division logic more generic to include CPU and GPU limits and replacing fragile string-based accelerator identification with a more robust fail-fast mechanism.

Comment thread pkg/orchestrator/gke/resource_resolver.go
Comment thread pkg/orchestrator/gke/resource_resolver.go Outdated
@Neelabh94 Neelabh94 force-pushed the parallel_container branch 2 times, most recently from 91a71e8 to 290357f Compare May 7, 2026 05:34
@Neelabh94

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the job submission process by replacing AcceleratorType with ComputeType and MachineType, and splitting numSlicesOrNodes into numNodes and numSlices. It introduces support for parallel containers on GKE, updates the orchestrator to handle hardware requirements more dynamically with caching, and updates documentation and tests accordingly. I have no feedback to provide as there were no review comments.

@Neelabh94 Neelabh94 force-pushed the parallel_container branch 2 times, most recently from 52ddf45 to 08a01ea Compare May 7, 2026 11:45
@Neelabh94 Neelabh94 force-pushed the parallel_container branch from 08a01ea to b7c72a0 Compare May 7, 2026 12:27
@Neelabh94 Neelabh94 marked this pull request as ready for review May 7, 2026 12:27
@Neelabh94 Neelabh94 requested a review from a team as a code owner May 7, 2026 12:27
@Neelabh94 Neelabh94 added the release-key-new-features Added to release notes under the "Key New Features" heading. label May 7, 2026
@Neelabh94 Neelabh94 enabled auto-merge (squash) May 7, 2026 12:30
@Neelabh94 Neelabh94 merged commit 07051fa into GoogleCloudPlatform:develop May 7, 2026
17 of 86 checks passed
@Neelabh94 Neelabh94 deleted the parallel_container branch May 7, 2026 12:59
@Neelabh94 Neelabh94 changed the title Allow parallel containers for TPU7 and TPU7x Allow parallel containers for TPU7x May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants