Skip to content

Regression from Helm 3: Hooks wait for the full timeout in case of failed resources #31729

@matheuscscp

Description

@matheuscscp

What happened?

In Flux we have a simple test case for running a release test that fails. The test is a Pod like this:

{{- if .Values.faults.testFail }}
apiVersion: v1
kind: Pod
metadata:
  name: {{ template "podinfo.fullname" . }}-fault-test-{{ randAlphaNum 5 | lower }}
  namespace: {{ include "podinfo.namespace" . }}
  labels:
    {{- include "podinfo.labels" . | nindent 4 }}
  annotations:
    "helm.sh/hook": test-success
    "helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
    sidecar.istio.io/inject: "false"
    linkerd.io/inject: disabled
    appmesh.k8s.aws/sidecarInjectorWebhook: disabled
spec:
  containers:
    - name: fault
      image: alpine:3.11
      command: ['/bin/sh']
      args:  ['-c', 'exit 1']
  restartPolicy: Never
{{- end }}

The legacy waiter from Helm 3 would see the Pod with .status.phase set to Failed and then bail out from waiting:

// waitForPodSuccess is a helper that waits for a pod to complete.
//
// This operates on an event returned from a watcher.
func (hw *legacyWaiter) waitForPodSuccess(obj runtime.Object, name string) (bool, error) {
	o, ok := obj.(*corev1.Pod)
	if !ok {
		return true, fmt.Errorf("expected %s to be a *v1.Pod, got %T", name, obj)
	}

	switch o.Status.Phase {
	case corev1.PodSucceeded:
		slog.Debug("pod succeeded", "pod", o.Name)
		return true, nil
	case corev1.PodFailed:
		slog.Error("pod failed", "pod", o.Name)
		return true, fmt.Errorf("pod %s failed", o.Name)

The new (hard-coded!) wait strategy for release testing in Helm 4 is the status watcher (kstatus). This waiter does not bail out if it sees a Pod with .status.phase set to Failed:

	eventCh := sw.Watch(cancelCtx, resources, watcher.Options{
		RESTScopeStrategy: watcher.RESTScopeNamespace,
	})
	statusCollector := collector.NewResourceStatusCollector(resources)
	done := statusCollector.ListenWithObserver(eventCh, statusObserver(cancel, status.CurrentStatus))
	<-done

The result is Helm waiting for the full timeout instead of bailing out early.

What did you expect to happen?

I expected the Helm 3 behavior to be preserved.

How can we reproduce it (as minimally and precisely as possible)?

Repro steps:

Just install a chart with the test Pod above in a kind cluster using Helm 4 and observe that Helm will wait for the full timeout instead of bailing out early.

Helm version

Helm 4

Kubernetes version

1.34

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions