Skip to content

PCLQ controller fails to replace terminating Pods on NotReady nodes due to missed reconcile #434

@xulinfei1996

Description

@xulinfei1996

What happened?

When a Node becomes NotReady, the Pods on it first transition to NotReady (triggering a reconcile where they aren't yet terminating), and later get deleted by the node controller (receiving a deletionTimestamp).
However, the PCLQ controller does not trigger a new reconcile when the deletionTimestamp is added. This causes the Pods to remain in a terminating state without being replaced, resulting in the actual replica count dropping below the expected count and affecting service availability.

What did you expect to happen?

When Pods on a NotReady node are marked for deletion (receive a deletionTimestamp), the PCLQ controller should detect this change and trigger a reconciliation. It should then promptly create new replacement Pods to maintain the desired replica count and ensure service availability.

Environment

  • Kubernetes version
  • Grove version
  • Scheduler details
  • Cloud provider or hardware configuration
  • Tools that you are using Grove together with
  • Anything else that is relevant

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions