Skip to content

Integrating Managed lustre in TPU v6e#4814

Merged
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubham/managed-lustre-tpu
Nov 4, 2025
Merged

Integrating Managed lustre in TPU v6e#4814
shubpal07 merged 1 commit into
GoogleCloudPlatform:developfrom
shubpal07:shubham/managed-lustre-tpu

Conversation

@shubpal07

@shubpal07 shubpal07 commented Nov 3, 2025

Copy link
Copy Markdown
Contributor

Description

This pull request enhances the gke-tpu-v6-advanced blueprint by integrating Google Cloud Managed Lustre as an optional storage solution. Large-scale TPU workloads are often bottlenecked by storage I/O, and Managed Lustre addresses this by providing a parallel file system capable of delivering extremely high throughput and low latency, maximizing TPU utilization.

Changes Included

  • Added commented-out variables for Lustre configuration to the blueprint vars.
  • Included new, commented-out modules for private_service_access, lustre_firewall_rule, managed-lustre, and `lustre-pv.
  • Enabled the enable_managed_lustre_csi flag on the GKE cluster module by default.
  • Updated the README.md with a comprehensive guide on enabling and using the new Managed Lustre integration.
    Validation:
  • Manual end-to-end testing was performed to validate the integration, confirming that a pod could successfully mount and perform I/O operations on the Lustre filesystem.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@shubpal07 shubpal07 self-assigned this Nov 3, 2025
@shubpal07 shubpal07 added release-key-new-features Added to release notes under the "Key New Features" heading. release-improvements Added to release notes under the "Improvements" heading. labels Nov 3, 2025
@shubpal07 shubpal07 marked this pull request as ready for review November 3, 2025 11:51
@shubpal07 shubpal07 requested review from a team and samskillman as code owners November 3, 2025 11:51
@shubpal07

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Google Cloud Managed Lustre into the gke-tpu-v6-advanced blueprint, providing an optional high-performance storage solution for TPU workloads. The changes include adding commented-out configurations for Lustre, enabling the Lustre CSI driver, and updating the README with usage instructions. The implementation also introduces a multi-network setup, which is a prerequisite.

My review focuses on improving the clarity and accuracy of the documentation and enhancing the maintainability of the YAML configuration. I've suggested corrections to the README to ensure the instructions are clear and accurate, and pointed out an opportunity to reduce code duplication in the blueprint YAML file.

Comment thread community/examples/gke-tpu-v6/README.md Outdated
Comment thread community/examples/gke-tpu-v6/README.md Outdated
Comment thread community/examples/gke-tpu-v6/README.md Outdated
Comment thread community/examples/gke-tpu-v6/gke-tpu-v6-advanced.yaml
@shubpal07 shubpal07 force-pushed the shubham/managed-lustre-tpu branch from c9bbb2a to 318b91d Compare November 4, 2025 09:24

@parulbajaj01 parulbajaj01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shubpal07 shubpal07 merged commit 8c7c3a9 into GoogleCloudPlatform:develop Nov 4, 2025
14 of 68 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading. release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants