Describe the bug
When advanced_machine_features.threads_per_core is set to 2 (SMT enabled), the auto-generated cloud.conf has incorrect socket/core/thread topology. Total CPUs are correct but topology breakdown is wrong, breaking CPU affinity and NUMA-aware scheduling.
Root cause: util.py line 2030 in template_machine_conf() hardcodes machine_conf.threads_per_core = 1. The getThreadsPerCore() helper is called but only used for the CPU divisor, never assigned to the machine config.
Steps to reproduce
- Create a nodeset with SMT enabled:
- id: compute_node
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
settings:
machine_type: n2d-highmem-32
advanced_machine_features:
threads_per_core: 2
- Deploy cluster
- Compare cloud.conf NodeName line with slurmd -C output on a compute node
Expected behavior
cloud.conf topology should match slurmd -C:
Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 CPUs=32
Actual behavior
cloud.conf generates:
Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32
Impact: task/affinity CPU binding broken, NUMA scheduling wrong (1×32 vs real 2×8×2), CR_Core_Memory can schedule two jobs on the same physical core.
Version (gcluster --version)
v1.90.0 (built from main branch, commit 8fb2919) and likely all before.
Blueprint
- id: compute_node
source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
settings:
machine_type: n2d-highmem-32
advanced_machine_features:
threads_per_core: 2
Output and logs
cloud.conf (generated):
NodeName=slurm0-computenode-[0-1] State=CLOUD RealMemory=254064 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32
slurmd -C (actual):
NodeName=slurm0-computenode-0 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=257414
Execution environment
- OS: Rocky Linux 8 (HPC image)
- Machine type: n2d-highmem-32
Additional context
Workaround: override via node_conf in the blueprint:
node_conf:
SocketsPerBoard: 2
CoresPerSocket: 8
ThreadsPerCore: 2
The fix in util.py template_machine_conf() would be to use getThreadsPerCore(template) instead of hardcoding 1, and derive cores_per_socket accounting for the thread count.
Describe the bug
When advanced_machine_features.threads_per_core is set to 2 (SMT enabled), the auto-generated cloud.conf has incorrect socket/core/thread topology. Total CPUs are correct but topology breakdown is wrong, breaking CPU affinity and NUMA-aware scheduling.
Root cause:
util.pyline 2030 intemplate_machine_conf()hardcodesmachine_conf.threads_per_core = 1. ThegetThreadsPerCore()helper is called but only used for the CPU divisor, never assigned to the machine config.Steps to reproduce
Expected behavior
cloud.conf topology should match slurmd -C:
Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 CPUs=32
Actual behavior
cloud.conf generates:
Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32
Impact: task/affinity CPU binding broken, NUMA scheduling wrong (1×32 vs real 2×8×2), CR_Core_Memory can schedule two jobs on the same physical core.
Version (gcluster --version)
v1.90.0 (built from main branch, commit 8fb2919) and likely all before.
Blueprint
Output and logs
cloud.conf (generated):
NodeName=slurm0-computenode-[0-1] State=CLOUD RealMemory=254064 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32
slurmd -C (actual):
NodeName=slurm0-computenode-0 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=257414
Execution environment
Additional context
Workaround: override via node_conf in the blueprint:
node_conf:
SocketsPerBoard: 2
CoresPerSocket: 8
ThreadsPerCore: 2
The fix in
util.pytemplate_machine_conf()would be to usegetThreadsPerCore(template)instead of hardcoding 1, and derivecores_per_socket accountingfor the thread count.