Addition of a deletionTimestamp on a Pod should trigger a PodClique reconciliation#433
Conversation
47116bc to
2db94d6
Compare
|
Thanks @xulinfei1996 , very great finding. |
2db94d6 to
9498f10
Compare
sanjaychatterjee
left a comment
There was a problem hiding this comment.
Thanks for making the fix! Looks good to me.
@unmarshall can you please review as well?
|
Also, can we include a unit test as well? |
26dbce6 to
c860e56
Compare
|
Reviewing this, please do not merge this yet. |
There was a problem hiding this comment.
Thanks for the PR! Just a few nit comments.
The pods would eventually reach .spec.replicas in number after the DELETE event is sent, but for pods that are stuck in Terminating for long, the replacement pods might take a long time to be created if no event is generated for that specific PodClique.
Also, please fix the formatting through GOWORK=off make check, as the check CI job is failing.
|
@xulinfei1996 I'm not sure if you have bandwidth to work on this anymore. Can the maintainers take over so we can push these changes in? cc @kangclzjc |
1eb7380 to
b46f954
Compare
Signed-off-by: xulinfei <xulinfei@bytedance.com>
Signed-off-by: Saketh Kalaga <saketh.kalaga@sap.com>
b46f954 to
2113fec
Compare
renormalize
left a comment
There was a problem hiding this comment.
thanks for spotting the bug and raising the fix!
deletionTimestamp on a Pod should trigger a PodClique reconciliation
…que` reconciliation (ai-dynamo#433) --------- Signed-off-by: xulinfei <xulinfei@bytedance.com> Co-authored-by: Saketh Kalaga <saketh.kalaga@sap.com>
…que` reconciliation (ai-dynamo#433) --------- Signed-off-by: xulinfei <xulinfei@bytedance.com> Co-authored-by: Saketh Kalaga <saketh.kalaga@sap.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
What type of PR is this?
/kind bug
What this PR does / why we need it:
In our scenario, when a Node becomes NotReady, the following sequence of events occurs:
deletionTimestamp.In the PCLQ controller, step 1 triggers a reconcile, but since the Pods are not yet marked with a
deletionTimestamp, the controller does not recognize them as being in a terminating state. By step 2, although the Pods now have adeletionTimestamp, the PCLQ controller does not trigger another reconcile.This results in: Pods that are in a terminating state, but the PCLQ controller does not create new Pods to replace them, thereby affecting service availability and the expected replica count.
This PR aims to fix this issue by ensuring that when Pods enter a terminating state, the PCLQ controller can promptly detect this and create new Pods to maintain the desired replica count.
Which issue(s) this PR fixes:
Fixes #434
Special notes for your reviewer:
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: