What you would like to be added?
Rolling update dynamically update a PodClique's podTemplate nodeAffinity rules, without triggering pod deletions/restarts (non-disruptive update).
Why is this needed?
Training workloads may encounter machine failures. The mitigation strategy includes:
- Automated/Manual Action: Adjust Job affinity to prevent scheduling new pods on failed nodes.
- Pod Recovery: Recreate affected pods with strict
nodeAffinity rules against faulty nodes.
- Preservation: Unaffected pods continue running to minimize disruption.
Currently, in step1, Grove will delete all pods during rolling update, including unaffected ones.
What you would like to be added?
Rolling update dynamically update a PodClique's podTemplate
nodeAffinityrules, without triggering pod deletions/restarts (non-disruptive update).Why is this needed?
Training workloads may encounter machine failures. The mitigation strategy includes:
nodeAffinityrules against faulty nodes.Currently, in step1, Grove will delete all pods during rolling update, including unaffected ones.