Skip to content

feat: Enable Block topology for A4X#5021

Merged
Neelabh94 merged 3 commits intoGoogleCloudPlatform:developfrom
Neelabh94:feature/block-topology
Dec 30, 2025
Merged

feat: Enable Block topology for A4X#5021
Neelabh94 merged 3 commits intoGoogleCloudPlatform:developfrom
Neelabh94:feature/block-topology

Conversation

@Neelabh94
Copy link
Contributor

@Neelabh94 Neelabh94 commented Dec 19, 2025

PR enables block topology for Multi-Node NVLink Platforms like A4x for SLURM version >=25.05

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Neelabh94, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the Slurm controller's ability to manage cluster topologies by introducing version-aware configuration generation. It specifically enables the use of Slurm's Block topology for A4X Multi-Node NVLink Platforms when the Slurm version is 25.05 or greater. This is achieved through a new system that dynamically selects between traditional .conf and modern YAML-based topology configurations, ensuring compatibility while leveraging advanced features for newer Slurm deployments.

Highlights

  • Slurm Version-Aware Configuration: The system now dynamically generates Slurm configuration based on the detected Slurm version, supporting different topology formats.
  • Block Topology Support for A4X: Enabled Slurm's Block topology for Multi-Node NVLink Platforms (A4X) when running Slurm version 25.05 or newer.
  • YAML-based Topology Configuration: Introduced a new YAML format for topology configuration (topology.yaml) for modern Slurm versions, replacing the older .conf file for these cases.
  • Legacy Configuration Preservation: The previous configuration logic has been moved to a dedicated conf_v2411.py module to maintain compatibility with older Slurm versions.
  • Dynamic Slurm Version Detection: Added functionality to detect the installed Slurm version at runtime to enable conditional configuration generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Neelabh94 Neelabh94 added release-key-new-features Added to release notes under the "Key New Features" heading. do-not-merge Block merging of this PR labels Dec 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables block topology for Slurm versions 25.05 and newer, which is a great feature for A4X platforms. The code is well-refactored to handle version differences by moving the logic for older Slurm versions to a new conf_v2411.py file. The new topology.yaml generation is comprehensive and includes good test coverage.

I have identified a few issues that need attention:

  • A potentially breaking change in slurmsync.py for hybrid deployments.
  • Some leftover debugging print statements in the tests.
  • A confusing comment and a forward reference in conf.py that affect maintainability.
I am having trouble creating individual review comments. Click here to see my feedback.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/slurmsync.py (414-416)

critical

The removal of this block seems to significantly change the behavior for hybrid deployments. The original comment indicated that Terraform is responsible for configuration in hybrid mode. By removing this, slurmsync.py will now always attempt to fetch and apply configuration from GCS, which might break existing hybrid deployment workflows. Could you please clarify if this change is intended and what the expected behavior for hybrid mode should be now? If this is an intended change, it should be highlighted in the PR description.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (69)

medium

The comment for BLOCK_SIZE is confusing. It states "closest power of two larger than BLOCK size ** 2", but with BLOCK_SIZE = 32, this calculation does not seem to hold. To improve clarity and prevent future confusion, please correct or rephrase the comment. A clearer comment might be something like: "Total number of blocks per block group, must be a power of two as required by Slurm."

BLOCK_SIZE = 32 # Total number of blocks per block group, must be a power of two as required by Slurm.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/conf.py (577-578)

medium

The constant _SLURM_TOPO_ROOT is used here but is defined later in the file at line 655. While this may not cause a NameError at runtime due to Python's module loading behavior, it is a forward reference that makes the code harder to read and maintain. It's best practice to define constants before they are used. Please consider moving the definition of _SLURM_TOPO_ROOT to the top of the file with other constants.

community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/tests/test_conf.py (193-200)

medium

These print statements appear to be for debugging purposes. Please remove them before merging.

    assert conf.conflines(util.Lookup(cfg)) == want

    cfg.cloud_parameters = addict.Dict(cfg.cloud_parameters)
    assert conf.conflines(util.Lookup(cfg)) == want

@Neelabh94 Neelabh94 force-pushed the feature/block-topology branch 3 times, most recently from 4dfdbdc to f02da9e Compare December 23, 2025 11:52
@Neelabh94
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Slurm's block topology for A4X platforms on Slurm versions 25.05 and newer. The implementation is well-structured, refactoring the configuration generation to be version-aware. For newer Slurm versions, it generates a topology.yaml file with block topology support, while maintaining the old topology.conf generation for older versions. This is achieved by detecting the Slurm version at runtime and dispatching to the appropriate logic. The test suite has been commendably updated to cover these new features and version-specific paths.

The overall approach is solid, but I've identified one potential issue in the block topology generation logic that could result in an invalid configuration for clusters with a single block group. My review includes a specific suggestion to address this.

@Neelabh94 Neelabh94 marked this pull request as ready for review December 24, 2025 12:51
@Neelabh94 Neelabh94 requested review from a team and samskillman as code owners December 24, 2025 12:51
@Neelabh94 Neelabh94 removed the do-not-merge Block merging of this PR label Dec 24, 2025
@Neelabh94 Neelabh94 force-pushed the feature/block-topology branch from ea457c5 to 744c412 Compare December 26, 2025 04:58
@Neelabh94 Neelabh94 force-pushed the feature/block-topology branch from 744c412 to 2d93637 Compare December 29, 2025 12:55
@Neelabh94 Neelabh94 merged commit 4ad9868 into GoogleCloudPlatform:develop Dec 30, 2025
73 of 75 checks passed
@Neelabh94 Neelabh94 deleted the feature/block-topology branch December 30, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-key-new-features Added to release notes under the "Key New Features" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants