Skip to content

Retry uncordon in reboot procedure#762

Merged
theboringstuff merged 1 commit intomainfrom
fix/reboot-uncordon-fix
Aug 11, 2025
Merged

Retry uncordon in reboot procedure#762
theboringstuff merged 1 commit intomainfrom
fix/reboot-uncordon-fix

Conversation

@theboringstuff
Copy link
Collaborator

@theboringstuff theboringstuff commented Aug 7, 2025

Description

  • The node could be in SchedulingDisabled status after reboot task if graceful_reboot: True in procedure.yaml

Fixes #654

Solution

  • Changed "drain/uncordon" condition
    • Previously node was drained/uncordoned if (1) graceful reboot is required and (2) node is k8s node (not balancer)
    • Now node is drained/uncordoned if (1) graceful reboot is required and (2) node is k8s node (not balancer) and (3) node is NOT new (not from add_node) and (4) cluster already installed (not fresh install)
  • Uncordon is now performed with retries, failure is not allowed

Test Cases

These changes affect following procedures:

  • fresh install (on freshly rebuilt nodes)
  • add_node (adding nodes of different roles balancer/worker/control-plane)
  • upgrade (if packages are specified for install/upgrade/remove on upgrade)
  • restore (at the end of restore)

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • There is no breaking changes, or migration patch is provided
  • Integration CI passed
  • Unit tests. If Yes list of new/changed tests with brief description
  • There is no merge conflicts

@theboringstuff theboringstuff changed the title reboot uncordon Mandatory uncordon after drain-reboot Aug 7, 2025
@theboringstuff theboringstuff marked this pull request as ready for review August 7, 2025 08:28
@theboringstuff theboringstuff changed the title Mandatory uncordon after drain-reboot Mandatory uncordon in reboot procedure Aug 7, 2025
@theboringstuff theboringstuff changed the title Mandatory uncordon in reboot procedure Retry uncordon in reboot procedure Aug 7, 2025
@andrewluckyguy andrewluckyguy self-requested a review August 11, 2025 09:40
Copy link
Collaborator

@andrewluckyguy andrewluckyguy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified by QA

@theboringstuff theboringstuff merged commit 8b2bfe1 into main Aug 11, 2025
29 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 11, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reboot does kubectl uncordon tolerating to its failure thus leaving the node unschedulable

3 participants