Skip to content

fix(slurm): respect visible_core_count in cloud.conf generation#5529

Merged
saara-tyagi27 merged 2 commits into
GoogleCloudPlatform:developfrom
saara-tyagi27:visible_core_count
May 8, 2026
Merged

fix(slurm): respect visible_core_count in cloud.conf generation#5529
saara-tyagi27 merged 2 commits into
GoogleCloudPlatform:developfrom
saara-tyagi27:visible_core_count

Conversation

@saara-tyagi27

Copy link
Copy Markdown
Contributor

Description

This PR fixes an issue where the visible_core_count setting (used to restrict OS-visible CPUs on supported machine types like C4) was ignored by the Slurm configuration generation scripts on the controller.

Previously, the scripts defaulted to the full CPU capacity of the machine type when writing cloud.conf, causing a mismatch with the actual restricted hardware. This led to nodes failing to register with INVALID_REG (Low CPUs) errors.

Changes

  • Modified template_machine_conf in community/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.py to check for visibleCoreCount in the instance template's advancedMachineFeatures.
  • If set, CPUs and CoresPerSocket are calculated based on the restricted count instead of the full machine type capacity.

Verification

  • Deployed a cluster with a blueprint setting visible_core_count to 18, 36, 72, and 144 on c4-highmem-288 instances.
  • Verified that scontrol show node now reports the correct restricted CPU counts (18, 36, 72, 144) in CfgTRES and CPUTot, resolving the configuration mismatch.
  • (Note: VM-level verification via lscpu was blocked by project quota errors preventing VM creation, but the fix for the Slurm configuration side is confirmed).

Fixes b/503047910

@saara-tyagi27 saara-tyagi27 requested a review from a team as a code owner April 20, 2026 05:26
@saara-tyagi27 saara-tyagi27 added bug Something isn't working python Pull requests that update Python code release-bugfix Added to release notes under the "Bug fixes" heading. labels Apr 20, 2026
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a configuration mismatch issue where Slurm nodes were failing to register due to incorrect CPU counts. By ensuring the controller respects the visible_core_count defined in the instance template, the generated cloud.conf now accurately reflects the restricted hardware capacity, resolving registration failures.

Highlights

  • Configuration Logic Update: Updated the Slurm configuration generation script to respect the visible_core_count setting from instance templates.
  • Resource Calculation: Adjusted CPU and core-per-socket calculations to align with restricted hardware counts, preventing registration errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Slurm machine configuration logic in util.py to account for visibleCoreCount specified in instance templates. While this adds flexibility for specialized machine types, a technical issue was identified in the calculation of cores_per_socket. Specifically, using physical cores instead of logical CPUs leads to a configuration mismatch that violates Slurm's internal invariants when hyperthreading is active, potentially causing node registration failures. A suggestion was made to derive cores_per_socket from the total logical CPU count to ensure consistency.

@LAVEEN LAVEEN left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM . Make sure all slurm test pass.

@saara-tyagi27

Copy link
Copy Markdown
Contributor Author

slurm a3h, a4h, a3m - their respective onspot tests are passing.
A4h custom blueprint and gke-a4 test failures due to insufficient capacity
Slurm-gcp-v6-tpu - respective daily test has been disabled

@saara-tyagi27 saara-tyagi27 merged commit 9b10081 into GoogleCloudPlatform:develop May 8, 2026
71 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working python Pull requests that update Python code release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants