Skip to content

Reconcile slow cause deleted pod never recreate #457

@kangclzjc

Description

@kangclzjc

What happened?

There is one scenario which can cause pod leak.

pclq has 3 replicas (pod1, pod2, pod3)

  1. T0: While pod1 is pending may due to resource limit or other issue, use kubectl delete pod manually. Pod1 is deleted successfully.
  2. T1: Informer cache updated (pod2, pod3)
  3. T2: PCLQ controller reconcile.
    diff := len(sc.existingPCLQPods) + len(createExpectations) - int(sc.pclq.Spec.Replicas) - len(deleteExpectations) // diff = 2 + 1 - 3 - 0 = 0

In this case, pod1 will never be recreate. The reason here is that reconcile is slow in some condition and informer cache update before reconcile. SyncExpectations works on the assumption that some pods is in terminating or let's say pod delete is slow than Reconcile. We couldn't tell these two scenarios:

  1. informer can't see the created pod
  2. Pod is already deleted

What did you expect to happen?

Pod1 should be recreated.

Environment

  • Kubernetes version
  • Grove version
  • Scheduler details
  • Cloud provider or hardware configuration
  • Tools that you are using Grove together with
  • Anything else that is relevant

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions