(Slurm) Implement dynamic machine configurations via API#5514
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a 'machine_configs' variable to the Slurm nodeset modules, enabling more granular control over GCE machine type definitions. Additionally, it cleans up the daily integration test configuration by removing hardcoded accelerator settings that are now better handled through the updated module configuration. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a machine_configs variable to both the dynamic and static Slurm nodeset modules, enabling the definition of GCE machine types and counts. It also simplifies the ml-a4x-highgpu-slurm.yaml daily test build by removing manual configuration injections. The review feedback suggests improving the user experience by changing the machine_configs variable type from string to any, which would allow blueprint authors to provide configurations as native YAML objects instead of raw JSON strings, while using jsonencode() to maintain consistency with other complex inputs in the toolkit.
…arshK15/cluster-toolkit into fix/dynamic-accelerator-slurm
|
/gcbrun |
SwarnaBharathiMantena
left a comment
There was a problem hiding this comment.
LGTM!
Multiple tests are successful, especially the Slurm A4X PR test.
|
Test failures:
|
8812dcb
into
GoogleCloudPlatform:develop
Summary
This PR implements dynamic machine configuration lookups for Slurm nodeset modules. It serves as a follow-up to PR #5426, which enabled this dynamic API-based lookup for GKE.
Key Changes
Module Variable: Adds a
machine_configsvariable to both dynamic and standard Slurm nodeset modules.Blueprint Flexibility: The
machine_configsvariable is defined with a type of any. This allows users to provide the configuration as a native YAML map or object in the blueprint, which is more natural and less error-prone than providing a raw JSON string.Test Cleanup: Removes hardcoded accelerator settings in a4x-slurm daily integration test that is now handled automatically.