Skip to content

feat(slurm): support compact placement with DWS Flex-Start for H4D, A3Ultra and A4#5579

Merged
parulbajaj01 merged 6 commits into
GoogleCloudPlatform:developfrom
parulbajaj01:dev
May 21, 2026
Merged

feat(slurm): support compact placement with DWS Flex-Start for H4D, A3Ultra and A4#5579
parulbajaj01 merged 6 commits into
GoogleCloudPlatform:developfrom
parulbajaj01:dev

Conversation

@parulbajaj01

Copy link
Copy Markdown
Contributor

This PR adds support for combining Dynamic Workload Scheduler (DWS) Flex-Start with Compact Placement in Slurm.

Previously, these two features were mutually exclusive. This PR enables them to work together, allowing users to request short-lived, cost-effective machines while still guaranteeing low-latency physical placement (crucial for AI/ML distributed training).

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@parulbajaj01 parulbajaj01 requested a review from a team as a code owner April 30, 2026 05:37
@parulbajaj01 parulbajaj01 added the release-key-new-features Added to release notes under the "Key New Features" heading. label Apr 30, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for using Compact Placement alongside DWS Flex-Start within the Slurm-GCP environment. By transitioning from standard Group Placement Policies to Workload Policies for dynamic MIGs, the changes allow users to leverage cost-effective, short-lived instances while maintaining the strict physical proximity requirements necessary for high-performance distributed AI/ML training workloads.

Highlights

  • Feature Integration: Enabled the combination of Dynamic Workload Scheduler (DWS) Flex-Start with Compact Placement in Slurm, specifically for H4D, A3Ultra, and A4 machine types.
  • Infrastructure Updates: Updated the Slurm controller scripts to support Workload Policies for dynamic Managed Instance Groups (MIGs) when using DWS Flex.
  • Validation Changes: Removed the mutual exclusivity constraints between DWS Flex and placement groups in the metadata and Terraform configuration.
  • Documentation: Added comprehensive documentation on how to configure and enable compact placement with DWS Flex-Start.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables compact placement support for DWS Flex nodes in the Slurm-GCP v6 modules. The changes remove Terraform-level restrictions, update the Python resume scripts to utilize workloadPolicy with HIGH_THROUGHPUT for Managed Instance Groups when flex is enabled, and map placement distances to topology distances. Documentation has been updated to reflect these enhancements. I have no feedback to provide.

@parulbajaj01 parulbajaj01 requested a review from arpit974 May 19, 2026 04:45
@parulbajaj01 parulbajaj01 merged commit a6e470b into GoogleCloudPlatform:develop May 21, 2026
13 of 80 checks passed
kadupoornima pushed a commit to kadupoornima/cluster-toolkit that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants