Switch to using gcsfuse profile feature in aiml gcs-bucket mounts in slurm cluster blueprints#5047
Conversation
Summary of ChangesHello @gargnitingoogle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the A3 Mega GPU Slurm blueprint by integrating specialized cloud storage solutions tailored for machine learning workloads. It introduces dedicated GCS buckets for managing checkpoints, training data, and model serving, each configured for optimal performance and observability. Additionally, the blueprint now includes an automated step to install Go, likely supporting new tooling or services, and incorporates a minor cleanup in the underlying Terraform module for cloud storage buckets. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several new Cloud Storage buckets to the a3mega-slurm-blueprint.yaml example, tailored for different machine learning workflow stages (checkpoints, training data, model serving). It also adds a startup script to install Go. My review focuses on improving the maintainability of the new script, cleaning up redundant configuration, and removing leftover debugging code. Additionally, please note that the pull request description is the default template and should be updated to explain the purpose of these changes and how they were tested, as per the repository's contribution guidelines (lines 42-44).
examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
examples/machine-learning/a3-megagpu-8g/a3mega-slurm-blueprint.yaml
Outdated
Show resolved
Hide resolved
8d520a9 to
bdfd10d
Compare
|
/gemini summary |
Summary of ChangesThis pull request refines the A3 Ultra GPU Slurm blueprint by simplifying and standardizing the GCSFuse mount configurations for critical AI/ML workloads. It introduces dedicated GCSFuse profiles for checkpointing, training data, and model serving, replacing a multitude of explicit caching and metadata options with more concise and potentially optimized settings. This change enhances clarity and maintainability of the storage configurations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
bdfd10d to
fc76677
Compare
|
/gemini summary |
|
/gemini review |
Summary of ChangesThis pull request streamlines the GCSFuse mount configurations within several AI/ML Slurm cluster blueprints. By adopting specific GCSFuse profiles for checkpointing, training data, and model serving, it replaces numerous explicit caching and metadata settings with a more concise and standardized approach. This change enhances the readability and maintainability of the storage configurations across the blueprints. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
There was a problem hiding this comment.
Code Review
This pull request updates several Slurm cluster blueprints to use the gcsfuse profile feature for AI/ML workloads. This is a great improvement as it simplifies the configuration, enhances readability, and aligns with best practices for mounting GCS buckets for checkpointing, training, and serving. However, I've found a critical issue in the implementation. The new multi-line mount_options strings are missing the \ line continuation character. This will cause the YAML parser to insert spaces between the options, leading to an invalid fstab entry and mount failures. I've added comments with suggestions to fix this in all affected files.
2a8da14 to
525af5e
Compare
|
/gemini summary |
Summary of ChangesThis pull request refactors the GCSFuse mount configurations within several AI/ML Slurm cluster blueprints. By leveraging the GCSFuse Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
525af5e to
433915b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively simplifies the gcsfuse mount options in the Slurm cluster blueprints by leveraging the profile feature for AIML workloads. The changes are applied consistently across the a3-ultragpu-8g, a4-highgpu-8g, and a4x-highgpu-4g blueprints, which improves maintainability. My review focuses on ensuring complete consistency in the naming convention for mount options. I've suggested a minor change to use underscores for all gcsfuse options to enhance readability.
433915b to
1c37225
Compare
1c37225 to
941997d
Compare
|
/gcbrun |
Description
profileparameter to replace the more fine-grained cache and config control parameters in the slurm cluster blueprints for A3-ultra, A4x-high, and A4-high node types for all the aiml workload bucket mount options. This makes the mount options in blueprints shorter and easier to maintain, as GCSFuse automatically sets the optimal value for the parameter based on the profile value (details).-with_in the gcsfuse mount_options in the changed blueprints, as these mount-options are copied directly to /etc/fstab file for mount, and it's required to use_instead of-in gcsfuse mount-options in such a case (persistent mounting documentation)a3-ultragpu-8g, and it reflected in the flags applied correctly on a compute node.a4-highgpu-8g, and it reflected in the mount logs and the file operations correctly in the mounted directories on the compute node.Note: Found couple of unrelated issues.
Mounting failed as local mount: /gcs-checkpoints was already in use in fstabseems to indicate that the srun command causes the mounts to be added to /etc/fstab again, which fails as expected. Not sure if this is a bug or is expected.mvoperations in the gcsfuse mounts in the slurm compute node, I got errorsmv: preserving times for '/gcs/sample_1GB_renamed.txt': Operation not permittedandmv: preserving permissions for ‘/gcs/sample_1GB_renamed.txt’: Operation not permittedwhich are expected errors because gcsfuse mount isn't fully POSIX-compliant and doesn't support file permissions changes, or preserving/propagation system times other than the modification time. But these errors showed up here because mv command tries to do both these operations and they fail. Themvcommand itself completes with exit-code 0 as expected.Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.