Skip to content

calico/node token is invalidated by Kubernetes when the pod is evicted, leading to CNI failures #4857

@dghubble

Description

@dghubble

Expected Behavior

Calico CNI plugin tears down Pod in a timely manner.

Current Behavior

Calico CNI plugin shows errors terminating Pods, and therefore eviction takes too long. Especially relevant in Kubernetes conformance testing.

Aug 18 18:19:04.521: INFO: At 2021-08-18 18:18:01 +0000 UTC - event for taint-eviction-a1: {kubelet ip-10-0-8-52} FailedKillPod: error killing pod: failed to "KillPodSandbox" for "0701ef9b-e
17d-43b5-a48f-89fa3ac00999" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"taint-eviction-a1_taint-multiple-pods-4011\" network: error
 getting ClusterInformation: connection is unauthorized: Unauthorized"

The natural things to check are RBAC permissions, which match recommendations:

- apiGroups:
  - crd.projectcalico.org
  resources:
  - globalfelixconfigs
  - felixconfigurations
  - bgppeers
  - globalbgpconfigs
  - bgpconfigurations
  - ippools
  - ipamblocks
  - globalnetworkpolicies
  - globalnetworksets
  - networkpolicies
  - networksets
  - clusterinformations
  - hostendpoints
  - blockaffinities
  verbs:
  - get
  - list
  - watch
...

To be certain, we can use the actual kubeconfig Calico writes to the host's /etc/cni/net.d. It does indeed seem to have permission to get clusterinformations. The error above is unusual.

./kubectl --kubeconfig /etc/cni/net.d/calico-kubeconfig auth can-i get clusterinformations --all-namespaces
yes

Steps to Reproduce (for bugs)

sonobuoy run --e2e-focus="NoExecuteTaintManager Multiple Pods" --e2e-skip="" \
--plugin-env=e2e.E2E_EXTRA_ARGS="--non-blocking-taints=node-role.kubernetes.io/controller"

Context

This issue affects Kubernetes Conformance tests:

Summarizing 1 Failure:

[Fail] [sig-node] NoExecuteTaintManager Multiple Pods [Serial] [It] evicts pods with minTolerationSeconds [Disruptive] [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/onsi/ginkgo/internal/leafnodes/runner.go:113

The test in question creates two Pods that don't tolerate a taint, and expects them to be terminated within certain times. In Kubelet logs, the Calico CNI plugin is complaining with the logs above and termination takes too long.

Your Environment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions