feat: Enable Block topology for A4X#5021
feat: Enable Block topology for A4X#5021Neelabh94 merged 3 commits intoGoogleCloudPlatform:developfrom
Conversation
Summary of ChangesHello @Neelabh94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Slurm controller's ability to manage cluster topologies by introducing version-aware configuration generation. It specifically enables the use of Slurm's Block topology for A4X Multi-Node NVLink Platforms when the Slurm version is 25.05 or greater. This is achieved through a new system that dynamically selects between traditional .conf and modern YAML-based topology configurations, ensuring compatibility while leveraging advanced features for newer Slurm deployments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enables block topology for Slurm versions 25.05 and newer, which is a great feature for A4X platforms. The code is well-refactored to handle version differences by moving the logic for older Slurm versions to a new conf_v2411.py file. The new topology.yaml generation is comprehensive and includes good test coverage.
I have identified a few issues that need attention:
- A potentially breaking change in
slurmsync.pyfor hybrid deployments. - Some leftover debugging print statements in the tests.
- A confusing comment and a forward reference in
conf.pythat affect maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/slurmsync.py (414-416)
The removal of this block seems to significantly change the behavior for hybrid deployments. The original comment indicated that Terraform is responsible for configuration in hybrid mode. By removing this, slurmsync.py will now always attempt to fetch and apply configuration from GCS, which might break existing hybrid deployment workflows. Could you please clarify if this change is intended and what the expected behavior for hybrid mode should be now? If this is an intended change, it should be highlighted in the PR description.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (69)
The comment for BLOCK_SIZE is confusing. It states "closest power of two larger than BLOCK size ** 2", but with BLOCK_SIZE = 32, this calculation does not seem to hold. To improve clarity and prevent future confusion, please correct or rephrase the comment. A clearer comment might be something like: "Total number of blocks per block group, must be a power of two as required by Slurm."
BLOCK_SIZE = 32 # Total number of blocks per block group, must be a power of two as required by Slurm.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (577-578)
The constant _SLURM_TOPO_ROOT is used here but is defined later in the file at line 655. While this may not cause a NameError at runtime due to Python's module loading behavior, it is a forward reference that makes the code harder to read and maintain. It's best practice to define constants before they are used. Please consider moving the definition of _SLURM_TOPO_ROOT to the top of the file with other constants.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/tests/test_conf.py (193-200)
These print statements appear to be for debugging purposes. Please remove them before merging.
assert conf.conflines(util.Lookup(cfg)) == want
cfg.cloud_parameters = addict.Dict(cfg.cloud_parameters)
assert conf.conflines(util.Lookup(cfg)) == want
4dfdbdc to
f02da9e
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for Slurm's block topology for A4X platforms on Slurm versions 25.05 and newer. The implementation is well-structured, refactoring the configuration generation to be version-aware. For newer Slurm versions, it generates a topology.yaml file with block topology support, while maintaining the old topology.conf generation for older versions. This is achieved by detecting the Slurm version at runtime and dispatching to the appropriate logic. The test suite has been commendably updated to cover these new features and version-specific paths.
The overall approach is solid, but I've identified one potential issue in the block topology generation logic that could result in an invalid configuration for clusters with a single block group. My review includes a specific suggestion to address this.
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py
Show resolved
Hide resolved
ea457c5 to
744c412
Compare
744c412 to
2d93637
Compare
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py
Show resolved
Hide resolved
community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/main.tf
Show resolved
Hide resolved
4ad9868
into
GoogleCloudPlatform:develop
PR enables block topology for Multi-Node NVLink Platforms like A4x for SLURM version >=25.05
Submission Checklist
NOTE: Community submissions can take up to 2 weeks to be reviewed.
Please take the following actions before submitting this pull request.