fix(slurm): respect visible_core_count in cloud.conf generation#5529
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a configuration mismatch issue where Slurm nodes were failing to register due to incorrect CPU counts. By ensuring the controller respects the visible_core_count defined in the instance template, the generated cloud.conf now accurately reflects the restricted hardware capacity, resolving registration failures. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the Slurm machine configuration logic in util.py to account for visibleCoreCount specified in instance templates. While this adds flexibility for specialized machine types, a technical issue was identified in the calculation of cores_per_socket. Specifically, using physical cores instead of logical CPUs leads to a configuration mismatch that violates Slurm's internal invariants when hyperthreading is active, potentially causing node registration failures. A suggestion was made to derive cores_per_socket from the total logical CPU count to ensure consistency.
LAVEEN
left a comment
There was a problem hiding this comment.
LGTM . Make sure all slurm test pass.
|
slurm a3h, a4h, a3m - their respective onspot tests are passing. |
9b10081
into
GoogleCloudPlatform:develop
Description
This PR fixes an issue where the
visible_core_countsetting (used to restrict OS-visible CPUs on supported machine types like C4) was ignored by the Slurm configuration generation scripts on the controller.Previously, the scripts defaulted to the full CPU capacity of the machine type when writing
cloud.conf, causing a mismatch with the actual restricted hardware. This led to nodes failing to register withINVALID_REG(Low CPUs) errors.Changes
template_machine_confincommunity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/util.pyto check forvisibleCoreCountin the instance template'sadvancedMachineFeatures.CPUsandCoresPerSocketare calculated based on the restricted count instead of the full machine type capacity.Verification
visible_core_countto 18, 36, 72, and 144 onc4-highmem-288instances.scontrol show nodenow reports the correct restricted CPU counts (18, 36, 72, 144) inCfgTRESandCPUTot, resolving the configuration mismatch.lscpuwas blocked by project quota errors preventing VM creation, but the fix for the Slurm configuration side is confirmed).Fixes b/503047910