Make Managed lustre default in A3u and A3m series Slurm blueprints#5396
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request transitions the default shared file system for the A3 UltraGPU and A4 HighGPU SLURM blueprints from Filestore to Managed Lustre. This change aims to leverage the performance benefits of Managed Lustre for high-performance computing workloads by updating the blueprint configurations to enable and utilize the new file system, ensuring a more optimized storage solution for these environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
958b61f to
dff3234
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the a3ultra-slurm-blueprint.yaml and a4high-slurm-blueprint.yaml files to transition from using Filestore for /home to Managed Lustre. This involves removing the filestore_ip_range variable, setting install_managed_lustre to true, switching the homefs module source to managed-lustre, and configuring its settings. For a4high-slurm-blueprint.yaml, the Managed Lustre-related variables (lustre_instance_id, lustre_size_gib, per_unit_storage_throughput) are uncommented and initialized. Feedback from the review indicates that in a3ultra-slurm-blueprint.yaml, the Managed Lustre variables are still commented out, which will cause the blueprint to fail. Additionally, obsolete comments related to filestore_ip_range and Managed Lustre instructions in a4high-slurm-blueprint.yaml should be removed to improve clarity and maintainability.
dc224fa to
f46cbc0
Compare
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates several machine learning blueprints (A3 HighGPU, A3 MegaGPU, A3 UltraGPU, A4 HighGPU, A4x HighGPU) to replace Filestore with Managed Lustre for shared home directories. The changes involve updating blueprint variables, switching the file system module, adding private service access, and enabling Managed Lustre installation. A consistent improvement opportunity was identified across all updated blueprints: the lustre_instance_id is currently a static string, which could cause conflicts upon multiple deployments. It is recommended to incorporate the deployment_name variable to ensure unique instance IDs.
|
/gcbrun |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request replaces Filestore with Managed Lustre across several A3 and A4 Slurm blueprints, incorporating the private service access module and updating relevant configuration variables and test tags. The review feedback identifies opportunities to improve maintainability and consistency by refining descriptive comments and ensuring uniform quoting of keys within Slurm configuration blocks.
1bb5c11 to
a36240b
Compare
c669440 to
4997920
Compare
|
/gcbrun |
e890859 to
84fa037
Compare
e27b4bb to
6596542
Compare
6596542 to
08942ef
Compare
arpit974
left a comment
There was a problem hiding this comment.
please review the failing test and make sure it is not due to this PR change.
|
A3m and A3u tests passed. |
493517c
into
GoogleCloudPlatform:develop
Summary
This PR replaces Google Cloud Filestore with Managed Lustre as the default shared filesystem in a3m, a3u and a4h machine learning blueprints
Key Changes:
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.