Skip to content

v1.8: CI: K8sDatapathConfig Etcd Check connectivity #11690

@pchaigno

Description

@pchaigno

https://jenkins.cilium.io/job/Cilium-PR-Ginkgo-Tests-Kernel/1645/testReport/Suite-k8s-1/17/K8sDatapathConfig_Etcd_Check_connectivity/

According to the dashboard, this test has been flaky since it was added on May 15.
One Cilium pod is reported unhealthy with one of the following:

Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": context deadline exceeded

but the logs and other commands (e.g., cilium status or cilium endpoint list) show that everything is okay.

Stacktrace

/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Kernel/k8s-1.17-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:471
cilium pre-flight checks failed
Expected
    <*errors.errorString | 0xc000291a30>: {
        s: "Cilium validation failed: 4m0s timeout expired: Last polled error: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-dfkrf': Exitcode: 255 \nStdout:\n \t \nStderr:\n \t Error: Cannot get status/probe: Put \"http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe\": context deadline exceeded\n\t \n\t command terminated with exit code 255\n\t \n",
    }
to be nil
/home/jenkins/workspace/Cilium-PR-Ginkgo-Tests-Kernel/k8s-1.17-gopath/src/github.com/cilium/cilium/test/helpers/manifest.go:297

Standard output

⚠️  Number of "context deadline exceeded" in logs: 20
Number of "level=error" in logs: 0
⚠️  Number of "level=warning" in logs: 30
Number of "Cilium API handler panicked" in logs: 0
Number of "Goroutine took lock for more than" in logs: 0
Top 5 errors/warnings:
Mutation detector is enabled, this will result in memory leakage.
BPF system config check: NOT OK.
BPF NodePort's external facing device could not be determined. Use --device to specify. Disabling BPF NodePort feature.
sessionAffinity for host reachable services needs kernel 5.7.0 or newer. Disabling sessionAffinity for cases when a service is accessed from a cluster.
Hubble server will be exposing its API insecurely on this address
Cilium pods: [cilium-dfkrf cilium-nhkx8]
Netpols loaded: 
CiliumNetworkPolicies loaded: 
Endpoint Policy Enforcement:
Pod                        Ingress   Egress
coredns-767d4c6dd7-4wcvd             
Cilium agent 'cilium-dfkrf': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 30 Failed 0
Cilium agent 'cilium-nhkx8': Status: Ok  Health: Ok Nodes "" ContinerRuntime:  Kubernetes: Ok KVstore: Ok Controllers: Total 24 Failed 0

Standard error

Show standard error

16:41:54 STEP: Deploying etcd-deployment.yaml in namespace kube-system
16:41:54 STEP: Waiting for 4m0s for 1 pods of deployment etcd-deployment.yaml to become ready
16:41:54 STEP: WaitforNPods(namespace="kube-system", filter="")
16:41:54 STEP: WaitforNPods(namespace="kube-system", filter="") => <nil>
16:41:54 STEP: Installing Cilium
16:41:54 STEP: Waiting for Cilium to become ready
16:41:54 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
16:41:59 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
16:42:04 STEP: Cilium DaemonSet not ready yet: only 0 of 2 desired pods are ready
16:42:09 STEP: Cilium DaemonSet not ready yet: only 1 of 2 desired pods are ready
16:42:14 STEP: Cilium DaemonSet not ready yet: only 1 of 2 desired pods are ready
16:42:19 STEP: Cilium DaemonSet not ready yet: only 1 of 2 desired pods are ready
16:42:24 STEP: Number of ready Cilium pods: 2
16:42:24 STEP: Installing DNS Deployment
16:42:24 STEP: Restarting DNS Pods
16:42:39 STEP: Validating Cilium Installation
16:42:39 STEP: Performing Cilium status preflight check
16:42:39 STEP: Performing Cilium health check
16:42:39 STEP: Performing Cilium controllers preflight check
16:42:47 STEP: Performing Cilium service preflight check
16:42:47 STEP: Performing K8s service preflight check
16:42:47 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-dfkrf': Exitcode: 255 
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": dial unix /var/run/cilium/health.sock: connect: no such file or directory
	 
	 command terminated with exit code 255
	 

16:42:47 STEP: Performing Cilium status preflight check
16:42:47 STEP: Performing Cilium controllers preflight check
16:42:47 STEP: Performing Cilium health check
16:42:55 STEP: Performing Cilium service preflight check
16:42:55 STEP: Performing K8s service preflight check
16:43:19 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-dfkrf': Exitcode: 255 
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": context deadline exceeded
	 
	 command terminated with exit code 255
	 

16:43:19 STEP: Performing Cilium controllers preflight check
16:43:19 STEP: Performing Cilium status preflight check
16:43:19 STEP: Performing Cilium health check
16:43:24 STEP: Performing Cilium service preflight check
16:43:24 STEP: Performing K8s service preflight check
16:43:51 STEP: Performing Cilium status preflight check
16:43:51 STEP: Performing Cilium health check
16:43:51 STEP: Performing Cilium controllers preflight check
16:43:59 STEP: Performing Cilium service preflight check
16:43:59 STEP: Performing K8s service preflight check
16:44:22 STEP: Performing Cilium status preflight check
16:44:22 STEP: Performing Cilium health check
16:44:22 STEP: Performing Cilium controllers preflight check
16:44:31 STEP: Performing Cilium service preflight check
16:44:31 STEP: Performing K8s service preflight check
16:44:54 STEP: Performing Cilium status preflight check
16:44:54 STEP: Performing Cilium controllers preflight check
16:44:54 STEP: Performing Cilium health check
16:45:02 STEP: Performing Cilium service preflight check
16:45:02 STEP: Performing K8s service preflight check
16:45:25 STEP: Performing Cilium status preflight check
16:45:25 STEP: Performing Cilium health check
16:45:25 STEP: Performing Cilium controllers preflight check
16:45:31 STEP: Performing Cilium service preflight check
16:45:31 STEP: Performing K8s service preflight check
16:45:57 STEP: Performing Cilium status preflight check
16:45:57 STEP: Performing Cilium controllers preflight check
16:45:57 STEP: Performing Cilium health check
16:46:04 STEP: Performing Cilium service preflight check
16:46:04 STEP: Performing K8s service preflight check
16:46:29 STEP: Cilium is not ready yet: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-dfkrf': Exitcode: 255 
Stdout:
 	 
Stderr:
 	 Error: Cannot get status/probe: Put "http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe": context deadline exceeded
	 
	 command terminated with exit code 255
	 

16:46:29 STEP: Performing Cilium status preflight check
16:46:29 STEP: Performing Cilium controllers preflight check
16:46:29 STEP: Performing Cilium health check
16:46:37 STEP: Performing Cilium service preflight check
16:46:37 STEP: Performing K8s service preflight check
FAIL: cilium pre-flight checks failed
Expected
    <*errors.errorString | 0xc000291a30>: {
        s: "Cilium validation failed: 4m0s timeout expired: Last polled error: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-dfkrf': Exitcode: 255 \nStdout:\n \t \nStderr:\n \t Error: Cannot get status/probe: Put \"http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe\": context deadline exceeded\n\t \n\t command terminated with exit code 255\n\t \n",
    }
to be nil
=== Test Finished at 2020-05-21T16:46:39Z====
16:46:39 STEP: Running JustAfterEach block for K8sDatapathConfig
===================== TEST FAILED =====================
16:46:39 STEP: Running AfterFailed block for K8sDatapathConfig
16:46:39 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:39 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:42 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:42 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
cmd: kubectl get pods -o wide --all-namespaces
Exitcode: 0 
Stdout:
 	 NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
	 kube-system   cilium-dfkrf                       1/1     Running   0          4m49s   192.168.36.12   k8s2   <none>           <none>
	 kube-system   cilium-nhkx8                       1/1     Running   0          4m49s   192.168.36.11   k8s1   <none>           <none>
	 kube-system   cilium-operator-5b7944dcc8-lhjdc   1/1     Running   0          4m49s   192.168.36.12   k8s2   <none>           <none>
	 kube-system   coredns-767d4c6dd7-4wcvd           1/1     Running   0          4m19s   10.0.1.212      k8s2   <none>           <none>
	 kube-system   etcd-k8s1                          1/1     Running   0          66m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   kube-apiserver-k8s1                1/1     Running   0          66m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   kube-controller-manager-k8s1       1/1     Running   0          66m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   kube-proxy-khm7w                   1/1     Running   0          64m     192.168.36.12   k8s2   <none>           <none>
	 kube-system   kube-proxy-p6vvj                   1/1     Running   0          66m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   kube-scheduler-k8s1                1/1     Running   0          66m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   log-gatherer-hvfr2                 1/1     Running   0          64m     192.168.36.12   k8s2   <none>           <none>
	 kube-system   log-gatherer-mlt9z                 1/1     Running   0          64m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   registry-adder-4q8s8               1/1     Running   0          64m     192.168.36.12   k8s2   <none>           <none>
	 kube-system   registry-adder-84kvl               1/1     Running   0          64m     192.168.36.11   k8s1   <none>           <none>
	 kube-system   stateless-etcd-7b9bfffcbd-lvcz7    1/1     Running   0          4m49s   192.168.36.12   k8s2   <none>           <none>
	 
Stderr:
 	 

Fetching command output from pods [cilium-dfkrf cilium-nhkx8]
16:46:45 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:45 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:47 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:47 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:49 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:49 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:51 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:51 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:52 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:53 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:56 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:56 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:46:58 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium")
16:46:59 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium") => <nil>
16:47:21 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:47:21 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
16:47:23 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:47:23 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
16:47:26 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs")
16:47:26 STEP: WaitforPods(namespace="kube-system", filter="-l k8s-app=cilium-test-logs") => <nil>
cmd: kubectl exec -n kube-system cilium-dfkrf -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                Ok   etcd: 1/1 connected, lease-ID=7c0272381d05963a, lock lease-ID=7c0272381d05963c, has-quorum=true: http://10.101.171.39:2379 - 3.4.7 (Leader)
	 Kubernetes:             Ok   1.17 (v1.17.5) [linux/amd64]
	 Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:   Probe   ()   [HostReachableServices (TCP, UDP)]
	 Cilium:                 Ok      OK
	 NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:   Ok   
	 IPAM:                   IPv4: 3/255 allocated from 10.0.1.0/24, IPv6: 3/255 allocated from fd00::100/120
	 Controller Status:      30/30 healthy
	 Proxy Status:           OK, ip 10.0.1.223, 0 redirects active on ports 10000-20000
	 Hubble:                 Ok              Current/Max Flows: 4096/4096 (100.00%), Flows/s: 18.54   Metrics: Disabled
	 Cluster health:         1/2 reachable   (2020-05-21T16:46:00Z)
	   Name                  IP              Reachable   Endpoints reachable
	   k8s1                  192.168.36.11   true        false
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-dfkrf -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                       IPv6        IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                                             
	 2485       Disabled           Disabled          44021      k8s:io.cilium.k8s.policy.cluster=default          fd00::1be   10.0.1.212   ready   
	                                                            k8s:io.cilium.k8s.policy.serviceaccount=coredns                                    
	                                                            k8s:io.kubernetes.pod.namespace=kube-system                                        
	                                                            k8s:k8s-app=kube-dns                                                               
	 2986       Disabled           Disabled          4          reserved:health                                   fd00::1ca   10.0.1.49    ready   
	 4045       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s2                                                 ready   
	                                                            reserved:host                                                                      
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-nhkx8 -- cilium status
Exitcode: 0 
Stdout:
 	 KVStore:                Ok   etcd: 1/1 connected, lease-ID=7c0272381d059605, lock lease-ID=7c0272381d059607, has-quorum=true: http://10.101.171.39:2379 - 3.4.7 (Leader)
	 Kubernetes:             Ok   1.17 (v1.17.5) [linux/amd64]
	 Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumNetworkPolicy", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1beta1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
	 KubeProxyReplacement:   Probe   ()   [HostReachableServices (TCP, UDP)]
	 Cilium:                 Ok      OK
	 NodeMonitor:            Listening for events on 3 CPUs with 64x4096 of shared memory
	 Cilium health daemon:   Ok   
	 IPAM:                   IPv4: 2/255 allocated from 10.0.0.0/24, IPv6: 2/255 allocated from fd00::/120
	 Controller Status:      24/24 healthy
	 Proxy Status:           OK, ip 10.0.0.39, 0 redirects active on ports 10000-20000
	 Hubble:                 Ok              Current/Max Flows: 3821/4096 (93.29%), Flows/s: 13.87   Metrics: Disabled
	 Cluster health:         2/2 reachable   (2020-05-21T16:45:36Z)
	 
Stderr:
 	 

cmd: kubectl exec -n kube-system cilium-nhkx8 -- cilium endpoint list
Exitcode: 0 
Stdout:
 	 ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])          IPv6       IPv4         STATUS   
	            ENFORCEMENT        ENFORCEMENT                                                                               
	 253        Disabled           Disabled          4          reserved:health                      fd00::b7   10.0.0.247   ready   
	 1323       Disabled           Disabled          1          k8s:cilium.io/ci-node=k8s1                                   ready   
	                                                            k8s:node-role.kubernetes.io/master                                   
	                                                            reserved:host                                                        
	 
Stderr:

68ee983f_K8sDatapathConfig_Etcd_Check_connectivity.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions