Bug
Several Deployments use init containers (clp.waitFor with type: "job") that
run kubectl wait --for=condition=complete job/<name> to block until
db-table-creator or results-cache-indices-creator finishes. However, both
Jobs set ttlSecondsAfterFinished: 300, so Kubernetes deletes the Job objects
5 minutes after completion.
Once the Jobs are garbage collected, the init containers fail with:
Error from server (NotFound): jobs.batch "test-clp-db-table-creator" not found
This puts the affected pods into Init:CrashLoopBackOff permanently. Any pod
restart after the 5-minute TTL window (e.g., node rotation, OOM kill, manual
kubectl set image, scaling events) will fail to start.
Affected Deployments (all that wait for a Job):
webui (waits for db-table-creator, results-cache-indices-creator)
api-server (waits for db-table-creator, results-cache-indices-creator)
compression-scheduler (waits for db-table-creator)
query-scheduler (waits for db-table-creator)
garbage-collector (waits for db-table-creator, results-cache-indices-creator)
mcp-server (waits for db-table-creator, results-cache-indices-creator)
log-ingestor (waits for db-table-creator)
reducer (waits for query-scheduler)
CLP version
eaf8c31
Environment
- EKS (Kubernetes v1.32) in us-east-2
- Also reproduced on kind (kindest/node:v1.35.0) on Ubuntu 22.04
Reproduction steps
- Deploy the Helm chart:
helm install test tools/deployment/package-helm
- Wait for all pods to be ready and both Jobs to complete.
- Wait 5+ minutes for the Job TTL to expire:
kubectl get jobs # returns "No resources found"
- Restart any affected Deployment (e.g., delete a pod or change the image):
kubectl delete pod -l app.kubernetes.io/component=webui
- The new pod enters
Init:CrashLoopBackOff:
NAME READY STATUS RESTARTS AGE
test-clp-webui-68897c7cd7-5t488 0/1 Init:CrashLoopBackOff 3 2m
Bug
Several Deployments use init containers (
clp.waitForwithtype: "job") thatrun
kubectl wait --for=condition=complete job/<name>to block untildb-table-creatororresults-cache-indices-creatorfinishes. However, bothJobs set
ttlSecondsAfterFinished: 300, so Kubernetes deletes the Job objects5 minutes after completion.
Once the Jobs are garbage collected, the init containers fail with:
This puts the affected pods into
Init:CrashLoopBackOffpermanently. Any podrestart after the 5-minute TTL window (e.g., node rotation, OOM kill, manual
kubectl set image, scaling events) will fail to start.Affected Deployments (all that wait for a Job):
webui(waits fordb-table-creator,results-cache-indices-creator)api-server(waits fordb-table-creator,results-cache-indices-creator)compression-scheduler(waits fordb-table-creator)query-scheduler(waits fordb-table-creator)garbage-collector(waits fordb-table-creator,results-cache-indices-creator)mcp-server(waits fordb-table-creator,results-cache-indices-creator)log-ingestor(waits fordb-table-creator)reducer(waits forquery-scheduler)CLP version
eaf8c31
Environment
Reproduction steps
helm install test tools/deployment/package-helmInit:CrashLoopBackOff: