Add Slurm V6 controller, partition & nodeset modules#1871
Conversation
11994ff to
3c1b016
Compare
cboneti
left a comment
There was a problem hiding this comment.
Please fix the READMEs and set default image family to Rocky.
cboneti
left a comment
There was a problem hiding this comment.
one minor change as you left the enable_reconfigure key there. Either remove it, or explain it is no longer needed.
|
PS. I have not reviewed the code and am mostly speaking from a v5 perspective. Maybe things have changed significanly with the inversion of the login/controller dependency order. |
|
Also, is this PR ready to go to develop ? Just curious since we have slurm_v6 branch as well. |
Yes, the plan is to land it into |
Will create a task specifically for this. |
* `enable_slurm_gcp_plugins` * `enable_reconfigure` * `enable_cleanup_subscriptions` * added `login_startup_script` similar to compute and controller ones; * `enable_debug_logging` & `extra_logging_flags`; SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element. * `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6; * `controller_instance_id` - doesn't seem to be used anywhere; * `additional_networks` * `nic_type` * `access_config` * `total_egress_bandwidth_tier` * * `login_network_storage` - use same logic as V5, use only common `network_storage`; * `network_ip` - use same logic as V5, use `static_ips` instead; * `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it; * `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO; Currently it's implicitly limited to many-nodesets to one-partition. AI: after aggregation of all nodesets into a list, remove all duplicates.
…form#1871) * `enable_slurm_gcp_plugins` * `enable_reconfigure` * `enable_cleanup_subscriptions` * added `login_startup_script` similar to compute and controller ones; * `enable_debug_logging` & `extra_logging_flags`; SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element. * `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6; * `controller_instance_id` - doesn't seem to be used anywhere; * `additional_networks` * `nic_type` * `access_config` * `total_egress_bandwidth_tier` * * `login_network_storage` - use same logic as V5, use only common `network_storage`; * `network_ip` - use same logic as V5, use `static_ips` instead; * `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it; * `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO; Currently it's implicitly limited to many-nodesets to one-partition. AI: after aggregation of all nodesets into a list, remove all duplicates.
…form#1871) * `enable_slurm_gcp_plugins` * `enable_reconfigure` * `enable_cleanup_subscriptions` * added `login_startup_script` similar to compute and controller ones; * `enable_debug_logging` & `extra_logging_flags`; SlurmV6 takes only one IP address in `static_ip`, so we need to limit `static_ips` to at most one element. * `cloud_logging_filter` & `pubsub_topic` were used for login module in V5 but dependency direction was changed in V6; * `controller_instance_id` - doesn't seem to be used anywhere; * `additional_networks` * `nic_type` * `access_config` * `total_egress_bandwidth_tier` * * `login_network_storage` - use same logic as V5, use only common `network_storage`; * `network_ip` - use same logic as V5, use `static_ips` instead; * `spot` (+ `termination_action`) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?); `controller` only, `nodeset` V6 will support it; * `network(network_self_link)` - V6 cluster doesn't provide `network` variable. Fallback to just `subnetwork_self_link`. TODO; Currently it's implicitly limited to many-nodesets to one-partition. AI: after aggregation of all nodesets into a list, remove all duplicates.
Changes from V5 to V6
Login modules will be fed into controller module (in V5 it's other way around)
REMOVED obsolete variables:
enable_slurm_gcp_pluginsenable_reconfigureenable_cleanup_subscriptionsAdded login startup script:
login_startup_scriptsimilar to compute and controller ones;New logging variables:
enable_debug_logging&extra_logging_flags;Limit
static_ipsto just one elementSlurmV6 takes only one IP address in
static_ip, so we need to limitstatic_ipsto at most one element.Do not output anything from controller module.
cloud_logging_filter&pubsub_topicwere used for login module in V5 but dependency direction was changed in V6;controller_instance_id- doesn't seem to be used anywhere;enable_placementmoved from partition to nodesetnetwork moved from partition to nodeset
TODOs:
A3 effort
Forward-port variables from V5 to V6:
additional_networksnic_typeaccess_configtotal_egress_bandwidth_tierMissing support for
reservation_name- TODOSKIPPED (present in SchedMD code, but omitted in Toolkit):
login_network_storage- use same logic as V5, use only commonnetwork_storage;network_ip- use same logic as V5, usestatic_ipsinstead;spot(+termination_action) - not present in V5, omit in ToolkitV6, though SchedMD supports it. TODO(?);controlleronly,nodesetV6 will support it;network(network_self_link)- V6 cluster doesn't providenetworkvariable. Fallback to justsubnetwork_self_link. TODO;Allow many-to-many relation between nodeset and partition
Currently it's implicitly limited to many-nodesets to one-partition.
AI: after aggregation of all nodesets into a list, remove all duplicates.