Skip to content

Helm Init containers fail after Job TTL expires, blocking pod restarts #2043

@junhaoliao

Description

@junhaoliao

Bug

Several Deployments use init containers (clp.waitFor with type: "job") that
run kubectl wait --for=condition=complete job/<name> to block until
db-table-creator or results-cache-indices-creator finishes. However, both
Jobs set ttlSecondsAfterFinished: 300, so Kubernetes deletes the Job objects
5 minutes after completion.

Once the Jobs are garbage collected, the init containers fail with:

Error from server (NotFound): jobs.batch "test-clp-db-table-creator" not found

This puts the affected pods into Init:CrashLoopBackOff permanently. Any pod
restart after the 5-minute TTL window (e.g., node rotation, OOM kill, manual
kubectl set image, scaling events) will fail to start.

Affected Deployments (all that wait for a Job):

  • webui (waits for db-table-creator, results-cache-indices-creator)
  • api-server (waits for db-table-creator, results-cache-indices-creator)
  • compression-scheduler (waits for db-table-creator)
  • query-scheduler (waits for db-table-creator)
  • garbage-collector (waits for db-table-creator, results-cache-indices-creator)
  • mcp-server (waits for db-table-creator, results-cache-indices-creator)
  • log-ingestor (waits for db-table-creator)
  • reducer (waits for query-scheduler)

CLP version

eaf8c31

Environment

  • EKS (Kubernetes v1.32) in us-east-2
  • Also reproduced on kind (kindest/node:v1.35.0) on Ubuntu 22.04

Reproduction steps

  1. Deploy the Helm chart:
    helm install test tools/deployment/package-helm
  2. Wait for all pods to be ready and both Jobs to complete.
  3. Wait 5+ minutes for the Job TTL to expire:
    kubectl get jobs  # returns "No resources found"
  4. Restart any affected Deployment (e.g., delete a pod or change the image):
    kubectl delete pod -l app.kubernetes.io/component=webui
  5. The new pod enters Init:CrashLoopBackOff:
    NAME                              READY   STATUS                  RESTARTS   AGE
    test-clp-webui-68897c7cd7-5t488   0/1     Init:CrashLoopBackOff   3          2m
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions