Skip to content

Add Slurm V6 controller, partition & nodeset modules#1871

Merged
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
mr0re1:s6
Oct 26, 2023
Merged

Add Slurm V6 controller, partition & nodeset modules#1871
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
mr0re1:s6

Conversation

@mr0re1

@mr0re1 mr0re1 commented Oct 24, 2023

Copy link
Copy Markdown
Collaborator

Changes from V5 to V6

Login modules will be fed into controller module (in V5 it's other way around)

REMOVED obsolete variables:

  • enable_slurm_gcp_plugins
  • enable_reconfigure
  • enable_cleanup_subscriptions

Added login startup script:

  • added login_startup_script similar to compute and controller ones;

New logging variables:

  • enable_debug_logging & extra_logging_flags;

Limit static_ips to just one element

SlurmV6 takes only one IP address in static_ip, so we need to limit static_ips to at most one element.

Do not output anything from controller module.

  • cloud_logging_filter & pubsub_topic were used for login module in V5 but dependency direction was changed in V6;
  • controller_instance_id - doesn't seem to be used anywhere;

enable_placement moved from partition to nodeset

network moved from partition to nodeset

TODOs:

A3 effort

Forward-port variables from V5 to V6:

  • additional_networks
  • nic_type
  • access_config
  • total_egress_bandwidth_tier

Missing support for reservation_name - TODO

SKIPPED (present in SchedMD code, but omitted in Toolkit):

  • login_network_storage - use same logic as V5, use only common network_storage;
  • network_ip - use same logic as V5, use static_ips instead;
  • spot (+ termination_action) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); controller only, nodeset V6 will support it;
  • network(network_self_link) - V6 cluster doesn't provide network variable. Fallback to just subnetwork_self_link. TODO;

Allow many-to-many relation between nodeset and partition

Currently it's implicitly limited to many-nodesets to one-partition.
AI: after aggregation of all nodesets into a list, remove all duplicates.

@mr0re1 mr0re1 requested a review from nick-stroud October 24, 2023 06:11
@mr0re1 mr0re1 added the release-new-modules Added to release notes under the "New Modules" heading. label Oct 24, 2023
@mr0re1 mr0re1 requested a review from harshthakkar01 October 24, 2023 16:04
@cboneti cboneti assigned cboneti and unassigned harshthakkar01 Oct 24, 2023
@mr0re1 mr0re1 force-pushed the s6 branch 4 times, most recently from 11994ff to 3c1b016 Compare October 24, 2023 22:24
@cboneti cboneti removed the request for review from nick-stroud October 24, 2023 23:17

@cboneti cboneti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the READMEs and set default image family to Rocky.

Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/README.md
@cboneti cboneti assigned mr0re1 and unassigned mr0re1 and cboneti Oct 24, 2023
@mr0re1 mr0re1 requested a review from cboneti October 25, 2023 05:37
@mr0re1 mr0re1 assigned cboneti and unassigned mr0re1 Oct 25, 2023

@cboneti cboneti left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one minor change as you left the enable_reconfigure key there. Either remove it, or explain it is no longer needed.

Comment thread community/modules/scheduler/schedmd-slurm-gcp-v6-controller/README.md Outdated
@cboneti cboneti assigned mr0re1 and unassigned cboneti Oct 25, 2023
@nick-stroud

Copy link
Copy Markdown
Collaborator

login_network_storage - use same logic as V5, use only common network_storage > I think this is worth some discussion. Personally I find the way that login magically inherits network storage from the controller to be confusing to users (Batch does the same thing I think and I don't really like it there either). The blueprint seems much more explicit and clear when the login has to declare its own dependencies.

PS. I have not reviewed the code and am mostly speaking from a v5 perspective. Maybe things have changed significanly with the inversion of the login/controller dependency order.

Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/README.md Outdated
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/variables.tf
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/main.tf Outdated
Comment thread community/modules/compute/schedmd-slurm-gcp-v6-partition/main.tf Outdated
@harshthakkar01

Copy link
Copy Markdown
Contributor

Also, is this PR ready to go to develop ? Just curious since we have slurm_v6 branch as well.

@mr0re1

mr0re1 commented Oct 25, 2023

Copy link
Copy Markdown
Collaborator Author

Also, is this PR ready to go to develop ? Just curious since we have slurm_v6 branch as well.

Yes, the plan is to land it into develop and remove slurm_v6 branch.

@mr0re1

mr0re1 commented Oct 25, 2023

Copy link
Copy Markdown
Collaborator Author

login_network_storage - use same logic as V5, use only common network_storage > I think this is worth some discussion. Personally I find the way that login magically inherits network storage from the controller to be confusing to users (Batch does the same thing I think and I don't really like it there either). The blueprint seems much more explicit and clear when the login has to declare its own dependencies.

Will create a task specifically for this.

* `enable_slurm_gcp_plugins`
* `enable_reconfigure`
* `enable_cleanup_subscriptions`

* added `login_startup_script` similar to compute and controller ones;

* `enable_debug_logging` & `extra_logging_flags`;

SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element.

* `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6;
* `controller_instance_id` - doesn't seem to be used anywhere;

* `additional_networks`
* `nic_type`
* `access_config`
* `total_egress_bandwidth_tier`
*

* `login_network_storage` - use same logic as V5, use only common `network_storage`;
* `network_ip` - use same logic as V5, use `static_ips` instead;
* `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it;
* `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO;

Currently it's implicitly limited to many-nodesets to one-partition.
AI: after aggregation of all nodesets into a list, remove all duplicates.
@cboneti cboneti assigned mr0re1 and unassigned cboneti Oct 26, 2023
@mr0re1 mr0re1 merged commit 33bf402 into GoogleCloudPlatform:develop Oct 26, 2023
@mr0re1 mr0re1 deleted the s6 branch October 26, 2023 16:59
cdunbar13 pushed a commit to cdunbar13/cluster-toolkit that referenced this pull request Oct 26, 2023
…form#1871)

* `enable_slurm_gcp_plugins`
* `enable_reconfigure`
* `enable_cleanup_subscriptions`

* added `login_startup_script` similar to compute and controller ones;

* `enable_debug_logging` & `extra_logging_flags`;

SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element.

* `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6;
* `controller_instance_id` - doesn't seem to be used anywhere;

* `additional_networks`
* `nic_type`
* `access_config`
* `total_egress_bandwidth_tier`
*

* `login_network_storage` - use same logic as V5, use only common `network_storage`;
* `network_ip` - use same logic as V5, use `static_ips` instead;
* `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it;
* `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO;

Currently it's implicitly limited to many-nodesets to one-partition.
AI: after aggregation of all nodesets into a list, remove all duplicates.
max-nag pushed a commit to max-nag/hpc-toolkit that referenced this pull request Oct 27, 2023
…form#1871)

* `enable_slurm_gcp_plugins`
* `enable_reconfigure`
* `enable_cleanup_subscriptions`

* added `login_startup_script` similar to compute and controller ones;

* `enable_debug_logging` & `extra_logging_flags`;

SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element.

* `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6;
* `controller_instance_id` - doesn't seem to be used anywhere;

* `additional_networks`
* `nic_type`
* `access_config`
* `total_egress_bandwidth_tier`
*

* `login_network_storage` - use same logic as V5, use only common `network_storage`;
* `network_ip` - use same logic as V5, use `static_ips` instead;
* `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it;
* `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO;

Currently it's implicitly limited to many-nodesets to one-partition.
AI: after aggregation of all nodesets into a list, remove all duplicates.
@mr0re1 mr0re1 mentioned this pull request Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-new-modules Added to release notes under the "New Modules" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants