feat: Enable Block topology for A4X by Neelabh94 · Pull Request #5021 · GoogleCloudPlatform/cluster-toolkit

Neelabh94 · 2025-12-19T07:15:46Z

PR enables block topology for Multi-Node NVLink Platforms like A4x for SLURM version >=25.05

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

gemini-code-assist · 2025-12-19T07:16:21Z

Summary of Changes

Hello @Neelabh94, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Slurm controller's ability to manage cluster topologies by introducing version-aware configuration generation. It specifically enables the use of Slurm's Block topology for A4X Multi-Node NVLink Platforms when the Slurm version is 25.05 or greater. This is achieved through a new system that dynamically selects between traditional .conf and modern YAML-based topology configurations, ensuring compatibility while leveraging advanced features for newer Slurm deployments.

Highlights

Slurm Version-Aware Configuration: The system now dynamically generates Slurm configuration based on the detected Slurm version, supporting different topology formats.
Block Topology Support for A4X: Enabled Slurm's Block topology for Multi-Node NVLink Platforms (A4X) when running Slurm version 25.05 or newer.
YAML-based Topology Configuration: Introduced a new YAML format for topology configuration (topology.yaml) for modern Slurm versions, replacing the older .conf file for these cases.
Legacy Configuration Preservation: The previous configuration logic has been moved to a dedicated conf_v2411.py module to maintain compatibility with older Slurm versions.
Dynamic Slurm Version Detection: Added functionality to detect the installed Slurm version at runtime to enable conditional configuration generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enables block topology for Slurm versions 25.05 and newer, which is a great feature for A4X platforms. The code is well-refactored to handle version differences by moving the logic for older Slurm versions to a new conf_v2411.py file. The new topology.yaml generation is comprehensive and includes good test coverage.

I have identified a few issues that need attention:

A potentially breaking change in slurmsync.py for hybrid deployments.
Some leftover debugging print statements in the tests.
A confusing comment and a forward reference in conf.py that affect maintainability.

I am having trouble creating individual review comments. Click here to see my feedback.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/slurmsync.py (414-416)

The removal of this block seems to significantly change the behavior for hybrid deployments. The original comment indicated that Terraform is responsible for configuration in hybrid mode. By removing this, slurmsync.py will now always attempt to fetch and apply configuration from GCS, which might break existing hybrid deployment workflows. Could you please clarify if this change is intended and what the expected behavior for hybrid mode should be now? If this is an intended change, it should be highlighted in the PR description.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (69)

The comment for BLOCK_SIZE is confusing. It states "closest power of two larger than BLOCK size ** 2", but with BLOCK_SIZE = 32, this calculation does not seem to hold. To improve clarity and prevent future confusion, please correct or rephrase the comment. A clearer comment might be something like: "Total number of blocks per block group, must be a power of two as required by Slurm."

BLOCK_SIZE = 32 # Total number of blocks per block group, must be a power of two as required by Slurm.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (577-578)

The constant _SLURM_TOPO_ROOT is used here but is defined later in the file at line 655. While this may not cause a NameError at runtime due to Python's module loading behavior, it is a forward reference that makes the code harder to read and maintain. It's best practice to define constants before they are used. Please consider moving the definition of _SLURM_TOPO_ROOT to the top of the file with other constants.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/tests/test_conf.py (193-200)

These print statements appear to be for debugging purposes. Please remove them before merging.

    assert conf.conflines(util.Lookup(cfg)) == want

    cfg.cloud_parameters = addict.Dict(cfg.cloud_parameters)
    assert conf.conflines(util.Lookup(cfg)) == want

Neelabh94 · 2025-12-24T06:51:58Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for Slurm's block topology for A4X platforms on Slurm versions 25.05 and newer. The implementation is well-structured, refactoring the configuration generation to be version-aware. For newer Slurm versions, it generates a topology.yaml file with block topology support, while maintaining the old topology.conf generation for older versions. This is achieved by detecting the Slurm version at runtime and dispatching to the appropriate logic. The test suite has been commendably updated to cover these new features and version-specific paths.

The overall approach is solid, but I've identified one potential issue in the block topology generation logic that could result in an invalid configuration for clusters with a single block group. My review includes a specific suggestion to address this.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/main.tf

Neelabh94 added release-key-new-features Added to release notes under the "Key New Features" heading. do-not-merge Block merging of this PR labels Dec 19, 2025

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

Neelabh94 force-pushed the feature/block-topology branch 3 times, most recently from 4dfdbdc to f02da9e Compare December 23, 2025 11:52

gemini-code-assist bot reviewed Dec 24, 2025

View reviewed changes

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py Show resolved Hide resolved

Neelabh94 marked this pull request as ready for review December 24, 2025 12:51

Neelabh94 requested review from a team and samskillman as code owners December 24, 2025 12:51

Neelabh94 removed the do-not-merge Block merging of this PR label Dec 24, 2025

Neelabh94 force-pushed the feature/block-topology branch from ea457c5 to 744c412 Compare December 26, 2025 04:58

Neelabh94 added 3 commits December 29, 2025 12:54

feat: Enable Block topology for A4X

d30b408

fix for TPU test failure

ab780d3

remove the deprecated cloud_reg_addrs parameter

2d93637

Neelabh94 force-pushed the feature/block-topology branch from 744c412 to 2d93637 Compare December 29, 2025 12:55

LAVEEN reviewed Dec 29, 2025

View reviewed changes

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py Show resolved Hide resolved

LAVEEN approved these changes Dec 30, 2025

View reviewed changes

sarthakag approved these changes Dec 30, 2025

View reviewed changes

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/main.tf Show resolved Hide resolved

Neelabh94 merged commit 4ad9868 into GoogleCloudPlatform:develop Dec 30, 2025
73 of 75 checks passed

Neelabh94 deleted the feature/block-topology branch December 30, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Enable Block topology for A4X#5021

feat: Enable Block topology for A4X#5021
Neelabh94 merged 3 commits intoGoogleCloudPlatform:developfrom
Neelabh94:feature/block-topology

Neelabh94 commented Dec 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Neelabh94 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Neelabh94 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/slurmsync.py (414-416)

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (69)

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (577-578)

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/tests/test_conf.py (193-200)

Uh oh!

Neelabh94 commented Dec 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Neelabh94 commented Dec 19, 2025 •

edited

Loading