Skip to content

Include MemSpecLimit when calculating defmem#3300

Merged
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
wiktorn:defmempernode_memspeclimit
Jan 10, 2025
Merged

Include MemSpecLimit when calculating defmem#3300
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
wiktorn:defmempernode_memspeclimit

Conversation

@wiktorn

@wiktorn wiktorn commented Nov 21, 2024

Copy link
Copy Markdown
Contributor

To prevent OOMKiller killing random processes on the node it is possible to define MemSpecLimit which reserves some of the memory for the system and limit job memory below what is available on the node.

This example reserves 1024MB of RAM for system on the node:

      - id: debug_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          disk_size_gb: 30
          machine_type: n2d-standard-2
          node_conf:
            MemSpecLimit: 1024

But with such definition, running job fails with:

$ srun -p debug hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

Though running job with:

$ srun -p debug --mem 100 hostname

Succeeds, as it requests less memory.

This change subtracts reserved memory from total memory available on the instance before calculating DefMemPerCPU which results in default memory claim within available memory.

This is most visible on 1 CPU nodes, but with larger nodes, at least one CPU may not be available for scheduling due to this.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@wiktorn wiktorn force-pushed the defmempernode_memspeclimit branch from 7243a02 to c030b28 Compare November 21, 2024 08:53
@wiktorn wiktorn requested a review from mr0re1 November 21, 2024 15:41
@wiktorn wiktorn added the release-improvements Added to release notes under the "Improvements" heading. label Nov 21, 2024
@mr0re1 mr0re1 self-assigned this Dec 19, 2024
@mr0re1 mr0re1 assigned wiktorn and unassigned mr0re1 Dec 19, 2024
@wiktorn wiktorn force-pushed the defmempernode_memspeclimit branch from c030b28 to 84e3d38 Compare December 20, 2024 17:04
@wiktorn wiktorn force-pushed the defmempernode_memspeclimit branch from 84e3d38 to a9f4617 Compare December 20, 2024 17:05
@wiktorn wiktorn assigned mr0re1 and unassigned wiktorn Dec 20, 2024
@wiktorn

wiktorn commented Dec 20, 2024

Copy link
Copy Markdown
Contributor Author

ready to go

@mr0re1

mr0re1 commented Jan 10, 2025

Copy link
Copy Markdown
Collaborator

/gcbrun

@mr0re1 mr0re1 enabled auto-merge January 10, 2025 21:41
@mr0re1 mr0re1 disabled auto-merge January 10, 2025 21:41
@mr0re1 mr0re1 merged commit c6cf753 into GoogleCloudPlatform:develop Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants