Skip to content

Fix multiple bugs in notifying jobs during failed resume#4107

Merged
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
mr0re1:resume_notify
May 12, 2025
Merged

Fix multiple bugs in notifying jobs during failed resume#4107
mr0re1 merged 1 commit into
GoogleCloudPlatform:developfrom
mr0re1:resume_notify

Conversation

@mr0re1

@mr0re1 mr0re1 commented May 9, 2025

Copy link
Copy Markdown
Collaborator
  • Fix invalid args formatting that leads to Update of this parameter is not supported: Quota errors;
  • Fix order of node shutting down and job notifications, previously majority of job notifications failed due to job has been killed already;
  • Remove unneeded logging.
2025-05-09 22:11:59,303 DEBUG: run: ['/usr/local/bin/scontrol', 'update', 'jobid=17', "admincomment=GCP Error: Quota 'C2_CPUS' exceeded. Limit: 300.0 in re
gion us-central1."]
2025-05-09 22:11:59,307 DEBUG: run: ['/usr/local/bin/scontrol', 'notify', '17', "GCP Error: Quota 'C2_CPUS' exceeded. Limit: 300.0 in region us-central1."]
2025-05-09 22:11:59,312 ERROR: Marking nodes qq-debugnodeset-[5003-5999] as DOWN, reason: GCP Error: Quota 'C2_CPUS' exceeded. Limit: 300.0 in region us-ce
ntral1.
2025-05-09 22:11:59,312 DEBUG: run: ['/usr/local/bin/scontrol', 'update', 'nodename=qq-debugnodeset-[5003-5999]', 'state=down', "reason=GCP Error: Quota 'C
2_CPUS' exceeded. Limit: 300.0 in region us-central1."]
$ sacct --format="JobID,AdminComment%80"
JobID                                                                            AdminComment 
------------ -------------------------------------------------------------------------------- 
17                   GCP Error: Quota 'C2_CPUS' exceeded. Limit: 300.0 in region us-central1.

@mr0re1 mr0re1 requested a review from samskillman as a code owner May 9, 2025 22:23
@mr0re1 mr0re1 added the release-bugfix Added to release notes under the "Bug fixes" heading. label May 9, 2025
@mr0re1 mr0re1 requested a review from a team as a code owner May 9, 2025 22:23
@mr0re1 mr0re1 assigned mr0re1 and unassigned harshthakkar01 May 9, 2025
@mr0re1 mr0re1 changed the title Resume notify Fix multiple bugs in notifying jobs during failed resume May 9, 2025
@mr0re1 mr0re1 assigned harshthakkar01 and unassigned mr0re1 May 9, 2025
@mr0re1 mr0re1 merged commit e3e065e into GoogleCloudPlatform:develop May 12, 2025
13 of 66 checks passed
@mr0re1 mr0re1 deleted the resume_notify branch May 12, 2025 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-bugfix Added to release notes under the "Bug fixes" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants