feat(slurm): support compact placement with DWS Flex-Start for H4D, A3Ultra and A4#5579
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for using Compact Placement alongside DWS Flex-Start within the Slurm-GCP environment. By transitioning from standard Group Placement Policies to Workload Policies for dynamic MIGs, the changes allow users to leverage cost-effective, short-lived instances while maintaining the strict physical proximity requirements necessary for high-performance distributed AI/ML training workloads. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enables compact placement support for DWS Flex nodes in the Slurm-GCP v6 modules. The changes remove Terraform-level restrictions, update the Python resume scripts to utilize workloadPolicy with HIGH_THROUGHPUT for Managed Instance Groups when flex is enabled, and map placement distances to topology distances. Documentation has been updated to reflect these enhancements. I have no feedback to provide.
a6e470b
into
GoogleCloudPlatform:develop
This PR adds support for combining Dynamic Workload Scheduler (DWS) Flex-Start with Compact Placement in Slurm.
Previously, these two features were mutually exclusive. This PR enables them to work together, allowing users to request short-lived, cost-effective machines while still guaranteeing low-latency physical placement (crucial for AI/ML distributed training).
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.