Skip to content

Make Managed lustre default in A3u and A3m series Slurm blueprints#5396

Merged
saara-tyagi27 merged 12 commits into
GoogleCloudPlatform:developfrom
saara-tyagi27:filesystem-a-series
May 25, 2026
Merged

Make Managed lustre default in A3u and A3m series Slurm blueprints#5396
saara-tyagi27 merged 12 commits into
GoogleCloudPlatform:developfrom
saara-tyagi27:filesystem-a-series

Conversation

@saara-tyagi27

@saara-tyagi27 saara-tyagi27 commented Mar 25, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR replaces Google Cloud Filestore with Managed Lustre as the default shared filesystem in a3m, a3u and a4h machine learning blueprints

Key Changes:

  • Storage Migration: Swapped the filestore module for managed-lustre and added the necessary private-service-access networking module.
  • Default Configuration: Enabled Managed Lustre installation by default ("install_managed_lustre": true).
  • Test Updates: Updated daily Cloud Build test tags to reflect the filesystem change.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request transitions the default shared file system for the A3 UltraGPU and A4 HighGPU SLURM blueprints from Filestore to Managed Lustre. This change aims to leverage the performance benefits of Managed Lustre for high-performance computing workloads by updating the blueprint configurations to enable and utilize the new file system, ensuring a more optimized storage solution for these environments.

Highlights

  • Managed Lustre Integration: The A3 UltraGPU and A4 HighGPU SLURM blueprints have been updated to use Managed Lustre as the default shared file system for /home, replacing Filestore.
  • Configuration Updates: Variables related to Filestore IP ranges were removed, and Managed Lustre specific variables (instance ID, size, throughput) were enabled and configured in the A4 HighGPU blueprint.
  • Module Switching: The homefs module in both blueprints was switched from modules/file-system/filestore to modules/file-system/managed-lustre, and the private_service_access module was explicitly enabled.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the a3ultra-slurm-blueprint.yaml and a4high-slurm-blueprint.yaml files to transition from using Filestore for /home to Managed Lustre. This involves removing the filestore_ip_range variable, setting install_managed_lustre to true, switching the homefs module source to managed-lustre, and configuring its settings. For a4high-slurm-blueprint.yaml, the Managed Lustre-related variables (lustre_instance_id, lustre_size_gib, per_unit_storage_throughput) are uncommented and initialized. Feedback from the review indicates that in a3ultra-slurm-blueprint.yaml, the Managed Lustre variables are still commented out, which will cause the blueprint to fail. Additionally, obsolete comments related to filestore_ip_range and Managed Lustre instructions in a4high-slurm-blueprint.yaml should be removed to improve clarity and maintainability.

Comment thread examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml
Comment thread examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml Outdated
@saara-tyagi27 saara-tyagi27 changed the title use lustre in a3u a4h Make lustre default in A series blueprints Mar 25, 2026
@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates several machine learning blueprints (A3 HighGPU, A3 MegaGPU, A3 UltraGPU, A4 HighGPU, A4x HighGPU) to replace Filestore with Managed Lustre for shared home directories. The changes involve updating blueprint variables, switching the file system module, adding private service access, and enabling Managed Lustre installation. A consistent improvement opportunity was identified across all updated blueprints: the lustre_instance_id is currently a static string, which could cause conflicts upon multiple deployments. It is recommended to incorporate the deployment_name variable to ensure unique instance IDs.

Comment thread examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a3-ultragpu-8g/a3ultra-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4-highgpu-8g/a4high-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4x-highgpu-4g/a4xhigh-slurm-blueprint.yaml Outdated
@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

/gcbrun

@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces Filestore with Managed Lustre across several A3 and A4 Slurm blueprints, incorporating the private service access module and updating relevant configuration variables and test tags. The review feedback identifies opportunities to improve maintainability and consistency by refining descriptive comments and ensuring uniform quoting of keys within Slurm configuration blocks.

Comment thread examples/machine-learning/a3-highgpu-8g/a3high-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml Outdated
Comment thread examples/machine-learning/a4x-highgpu-4g/a4xhigh-slurm-blueprint.yaml Outdated
@saara-tyagi27 saara-tyagi27 changed the title Make lustre default in A series blueprints Make Managed lustre default in A series Slurm blueprints Apr 1, 2026
@saara-tyagi27 saara-tyagi27 force-pushed the filesystem-a-series branch 2 times, most recently from c669440 to 4997920 Compare April 7, 2026 03:37
@saara-tyagi27 saara-tyagi27 changed the title Make Managed lustre default in A series Slurm blueprints Make Managed lustre default in A3u, A4h, A3m series Slurm blueprints Apr 8, 2026
@saara-tyagi27 saara-tyagi27 added enhancement New feature or request release-improvements Added to release notes under the "Improvements" heading. labels Apr 8, 2026
@saara-tyagi27 saara-tyagi27 marked this pull request as ready for review April 8, 2026 06:12
@saara-tyagi27 saara-tyagi27 requested review from a team and samskillman as code owners April 8, 2026 06:12
@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

/gcbrun

@saara-tyagi27 saara-tyagi27 force-pushed the filesystem-a-series branch 2 times, most recently from e27b4bb to 6596542 Compare May 7, 2026 06:21
@saara-tyagi27 saara-tyagi27 force-pushed the filesystem-a-series branch from 6596542 to 08942ef Compare May 20, 2026 07:30
@saara-tyagi27 saara-tyagi27 changed the title Make Managed lustre default in A3u, A4h, A3m series Slurm blueprints Make Managed lustre default in A3u and A3m series Slurm blueprints May 20, 2026

@arpit974 arpit974 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please review the failing test and make sure it is not due to this PR change.

@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

A3m and A3u tests passed.
slurm-gcp-v6-static - failing, unrelated to my changes in the pr.

@saara-tyagi27 saara-tyagi27 merged commit 493517c into GoogleCloudPlatform:develop May 25, 2026
18 of 83 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants