Skip to content

(Slurm) Implement dynamic machine configurations via API#5514

Merged
AdarshK15 merged 11 commits into
GoogleCloudPlatform:developfrom
AdarshK15:fix/dynamic-accelerator-slurm
Apr 20, 2026
Merged

(Slurm) Implement dynamic machine configurations via API#5514
AdarshK15 merged 11 commits into
GoogleCloudPlatform:developfrom
AdarshK15:fix/dynamic-accelerator-slurm

Conversation

@AdarshK15

@AdarshK15 AdarshK15 commented Apr 16, 2026

Copy link
Copy Markdown
Member

Summary

This PR implements dynamic machine configuration lookups for Slurm nodeset modules. It serves as a follow-up to PR #5426, which enabled this dynamic API-based lookup for GKE.

Key Changes

Module Variable: Adds a machine_configs variable to both dynamic and standard Slurm nodeset modules.
Blueprint Flexibility: The machine_configs variable is defined with a type of any. This allows users to provide the configuration as a native YAML map or object in the blueprint, which is more natural and less error-prone than providing a raw JSON string.
Test Cleanup: Removes hardcoded accelerator settings in a4x-slurm daily integration test that is now handled automatically.

@AdarshK15 AdarshK15 requested review from a team and samskillman as code owners April 16, 2026 11:24
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a 'machine_configs' variable to the Slurm nodeset modules, enabling more granular control over GCE machine type definitions. Additionally, it cleans up the daily integration test configuration by removing hardcoded accelerator settings that are now better handled through the updated module configuration.

Highlights

  • New Configuration Parameter: Added a new 'machine_configs' variable to both dynamic and standard Slurm nodeset modules to allow for more flexible GCE machine type and count definitions.
  • Module Integration: Updated the 'gpu' module integration within the nodeset modules to pass the new 'machine_configs' parameter.
  • Test Cleanup: Removed redundant manual sed-based accelerator configuration in the 'ml-a4x-highgpu-slurm' daily test build.
  • Documentation Updates: Updated README files for nodeset modules to include the new input variable and corrected table formatting.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@AdarshK15 AdarshK15 marked this pull request as draft April 16, 2026 11:24
@AdarshK15 AdarshK15 changed the title Fixdynamic accelerator slurm Fix: dynamic accelerator slurm Apr 16, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a machine_configs variable to both the dynamic and static Slurm nodeset modules, enabling the definition of GCE machine types and counts. It also simplifies the ml-a4x-highgpu-slurm.yaml daily test build by removing manual configuration injections. The review feedback suggests improving the user experience by changing the machine_configs variable type from string to any, which would allow blueprint authors to provide configurations as native YAML objects instead of raw JSON strings, while using jsonencode() to maintain consistency with other complex inputs in the toolkit.

Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/main.tf
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/variables.tf
@LAVEEN LAVEEN changed the title Fix: dynamic accelerator slurm (Slurm) Implement dynamic machine configurations via API Apr 16, 2026
@AdarshK15

Copy link
Copy Markdown
Member Author

/gcbrun

@AdarshK15 AdarshK15 added the release-module-improvements Added to release notes under the "Module Improvements" heading. label Apr 16, 2026

@SwarnaBharathiMantena SwarnaBharathiMantena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Multiple tests are successful, especially the Slurm A4X PR test.

@AdarshK15 AdarshK15 marked this pull request as ready for review April 17, 2026 09:01

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AdarshK15

Copy link
Copy Markdown
Member Author

Test failures:

  1. Several tests are failing because of capacity/reservation, these are disabled in Daily tests.
  2. gke-a3-ultragpu-onspot => failing due to insufficient capacity.
  3. gke-h4d-onspot => failing due to filestore limit reached.
    The above failures are not related to the code changes in this PR.
    Remaining all tests are passing.

@AdarshK15 AdarshK15 merged commit 8812dcb into GoogleCloudPlatform:develop Apr 20, 2026
73 of 82 checks passed
AdarshK15 added a commit to AdarshK15/cluster-toolkit that referenced this pull request Apr 20, 2026
@AdarshK15 AdarshK15 deleted the fix/dynamic-accelerator-slurm branch May 3, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-module-improvements Added to release notes under the "Module Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants