fix(slurm): allow hyphens in slurm_cluster_name for power managed nodes#5437
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a regression where Slurm cluster names containing hyphens caused node parsing failures. By moving away from a regex-based approach to a more deterministic, token-based parsing strategy, the system can now correctly identify cluster and nodeset components even when hyphens are present in the cluster name, preventing nodes from entering an incorrect state. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the _node_desc method in util.py to replace regex-based parsing with manual string manipulation for extracting node metadata. A significant issue was identified in the new logic for identifying nodesets, which is susceptible to incorrect matches when nodeset names share suffixes. It is recommended to use the existing cluster name configuration to reliably determine the boundary between the cluster and nodeset names, as provided in the suggested code block.
LAVEEN
left a comment
There was a problem hiding this comment.
LGTM , please make sure all slurm test pass successfully before merging.
4886c9a to
f1fa13b
Compare
12286d8
into
GoogleCloudPlatform:develop
This PR fixes a regression introduced in PR #4316 where the validation for
slurm_cluster_namewas relaxed to allow hyphens, but the runtime Python scripts (util.py) were not updated to handle them.The previous implementation used a regular expression (
node_desc_regex) that broke when parsing node names if the cluster name contained hyphens (e.g.,a3mega2-new). This caused nodes to be incorrectly ignored, leading to cluster creation failures where worker nodes were stuck in aCONFIGURINGstate.Replaced the fragile
node_desc_regexwith a robust token-based parser in_node_desc. Instead of relying on a regex split (which becomes ambiguous if both the cluster name and the nodeset contain hyphens), the new parser:nodeset_namekeys in the prefix to find the exact boundary between the cluster name and the nodeset name.This approach is extremely robust and eliminates edge-case ambiguities.
Verification Results: