Skip to content

threads_per_core=2 produces incorrect topology in cloud.conf #5668

@matx-alex

Description

@matx-alex

Describe the bug
When advanced_machine_features.threads_per_core is set to 2 (SMT enabled), the auto-generated cloud.conf has incorrect socket/core/thread topology. Total CPUs are correct but topology breakdown is wrong, breaking CPU affinity and NUMA-aware scheduling.

Root cause: util.py line 2030 in template_machine_conf() hardcodes machine_conf.threads_per_core = 1. The getThreadsPerCore() helper is called but only used for the CPU divisor, never assigned to the machine config.

Steps to reproduce

  1. Create a nodeset with SMT enabled:
- id: compute_node
  source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
  settings:
    machine_type: n2d-highmem-32
    advanced_machine_features:
      threads_per_core: 2
  1. Deploy cluster
  2. Compare cloud.conf NodeName line with slurmd -C output on a compute node

Expected behavior

cloud.conf topology should match slurmd -C:
Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 CPUs=32

Actual behavior

cloud.conf generates:
Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32

Impact: task/affinity CPU binding broken, NUMA scheduling wrong (1×32 vs real 2×8×2), CR_Core_Memory can schedule two jobs on the same physical core.

Version (gcluster --version)

v1.90.0 (built from main branch, commit 8fb2919) and likely all before.

Blueprint

  - id: compute_node
    source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
    settings:
      machine_type: n2d-highmem-32
      advanced_machine_features:
        threads_per_core: 2

Output and logs

cloud.conf (generated):
NodeName=slurm0-computenode-[0-1] State=CLOUD RealMemory=254064 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=1 CPUs=32

slurmd -C (actual):
NodeName=slurm0-computenode-0 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=257414

Execution environment

  • OS: Rocky Linux 8 (HPC image)
  • Machine type: n2d-highmem-32

Additional context

Workaround: override via node_conf in the blueprint:
node_conf:
SocketsPerBoard: 2
CoresPerSocket: 8
ThreadsPerCore: 2

The fix in util.py template_machine_conf() would be to use getThreadsPerCore(template) instead of hardcoding 1, and derive cores_per_socket accounting for the thread count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions