Initial test of using gpu_topology by cdunbar13 · Pull Request #4150 · GoogleCloudPlatform/cluster-toolkit

cdunbar13 · 2025-05-20T14:01:05Z

In anticipation of new machine_types, this PR adds the accelerator_topology field to the slurm controller. It eventually translates to gpuTopology for the placement policy for the slurm cluster.

This can be a template for a similar field for the TPU version of the nodeset and can share the same variable until it hits resume.py.

This will be tested soon, do not merge.

mr0re1 · 2025-07-11T18:56:16Z


    return placements

+def calculate_hosts_per_topo(gpu_topology: str, machine_type: NSDict) -> int:


machine_type is MachineType , not an NSDict, please make sure that mypy is happy)

mr0re1 · 2025-07-11T18:57:25Z

+        top_split = [int(x) for x in str(gpu_topology).split("x")]
+    except Exception as e:
+        log.error("Incorrectly formatted accelerator topology")
+        return 0


Returning magic number is a bad taste. Let's raise an exception instead.

mr0re1 · 2025-07-11T18:58:09Z

+    # Look for gpu topology first
+    if gpu_topology is not None:
+        hosts_per_topo = calculate_hosts_per_topo(gpu_topology, machine_type)
+        if hosts_per_topo > 0:


Catch exception and log instead of using magic comparison.

mr0re1 · 2025-07-11T18:58:40Z


    return placements

+def calculate_hosts_per_topo(gpu_topology: str, machine_type: NSDict) -> int:


It would be nice to add unit tests

mr0re1 · 2025-07-11T18:59:44Z

+    try:
+        top_split = [int(x) for x in str(gpu_topology).split("x")]
+    except Exception as e:
+        log.error("Incorrectly formatted accelerator topology")


All errors / exceptions should contain the accelerator_topology value to be useful.

mr0re1 · 2025-07-11T19:00:17Z

+def calculate_hosts_per_topo(gpu_topology: str, machine_type: NSDict) -> int:
+    # Calculate total number of hosts per topology (Assumes format: '1x72')
+    try:
+        top_split = [int(x) for x in str(gpu_topology).split("x")]


gpu_topology is already str, why cast?

mr0re1 · 2025-07-11T19:01:17Z


    return placements

+def calculate_hosts_per_topo(gpu_topology: str, machine_type: NSDict) -> int:


nit. rename gpu_topology -> topology OR accelerator_topology OR topo

mr0re1 · 2025-07-11T19:02:56Z

+        log.error("Incorrectly formatted accelerator topology")
+        return 0
+
+    gpus_per_machine = machine_type.accelerators[0].count


This will throw if len(machine_type.accelerators) == 0

mr0re1 · 2025-07-11T19:03:27Z

+        log.error("Incorrectly formatted accelerator topology")
+    elif gpus_per_machine <= 0:
+        log.error("Cannot use accelerator topology with machine type that has no accelerators")
+    elif top_split[1] % machine_type.accelerators[0].count != 0:


s/machine_type.accelerators[0].count/gpus_per_machine/

mr0re1 · 2025-07-11T19:04:05Z

+    elif top_split[1] % machine_type.accelerators[0].count != 0:
+        log.error("GPU count per node must be a factor of the gpu topology")
+    else:
+        return (top_split[0] * top_split[1]) // gpus_per_machine


Check that top_split[0] > 0, otherwise we up for surprise.

mr0re1 · 2025-07-11T19:04:24Z

+    else:
+        return (top_split[0] * top_split[1]) // gpus_per_machine
+
+    return 0


mr0re1 · 2025-07-11T19:06:02Z



-def create_placement_request(pg_name: str, region: str, max_distance: Optional[int]):
+def create_placement_request(pg_name: str, region: str, max_distance: Optional[int], gpu_topology: Optional[str]):


nit. rename gpu_topology -> accelerator_topology where applicable, to make codebase less confusing and more "searchable".

mr0re1 · 2025-07-11T19:06:35Z

        nodeset = self.node_nodeset(node_name)
        return parse_self_link(nodeset.subnetwork).region

+    def nodeset_gpu_topology(self, nodeset_name: str) -> str:


Optional[str]

Neelabh94 · 2026-04-30T05:18:37Z

@cdunbar13 I believe this is no longer needed, hence closing this. Please feel free to re-open if it is still needed.

Initial test of using gpu_topology

23c516c

cdunbar13 added the do-not-merge Block merging of this PR label May 20, 2025

cdunbar13 assigned alyssa-sm and mr0re1 May 20, 2025

mr0re1 reviewed May 20, 2025

View reviewed changes

Comment thread community/modules/compute/schedmd-slurm-gcp-v6-nodeset/outputs.tf

mr0re1 previously approved these changes May 20, 2025

View reviewed changes

cdunbar13 dismissed mr0re1’s stale review via a757612 May 22, 2025 17:44

mr0re1 reviewed May 22, 2025

View reviewed changes

Comment thread ...nity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/resume.py Outdated

mr0re1 reviewed May 22, 2025

View reviewed changes

Comment thread ...nity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/resume.py Outdated

cdunbar13 force-pushed the gpu_topology branch from a757612 to 18ab740 Compare May 23, 2025 14:13

cdunbar13 requested a review from mr0re1 May 23, 2025 14:13

cdunbar13 force-pushed the gpu_topology branch from 18ab740 to feed61a Compare May 23, 2025 16:04

Adding chunking of placements based on topology size

fa19730

cdunbar13 force-pushed the gpu_topology branch from feed61a to fa19730 Compare May 23, 2025 16:12

mr0re1 reviewed Jul 11, 2025

View reviewed changes

alyssa-sm mentioned this pull request Jul 15, 2025

Implement accelerator topology #4404

Merged

mr0re1 removed their assignment Oct 4, 2025

sudheer-quad added the external PR from external contributor label Feb 4, 2026

Neelabh94 closed this Apr 30, 2026


		return placements

		def calculate_hosts_per_topo(gpu_topology: str, machine_type: NSDict) -> int:



		def create_placement_request(pg_name: str, region: str, max_distance: Optional[int]):
		def create_placement_request(pg_name: str, region: str, max_distance: Optional[int], gpu_topology: Optional[str]):

Conversation

cdunbar13 commented May 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Neelabh94 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants