test: Wait until host EP is ready (=regenerated)#18859
Conversation
|
/test |
ce1643c to
2c6f95d
Compare
|
/test Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
|
@joestringer Requested you to review mainly to validate the state "ready" == the endpoint has been regenerated after the startup assumption. |
|
/test-gke Job 'Cilium-PR-K8s-GKE' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
|
/test-gke |
|
/test-1.23-net-next Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed: Click to show.Test NameFailure OutputIf it is a flake and a GitHub issue doesn't already exist to track it, comment |
2c6f95d to
dd32ad5
Compare
|
/test-1.23-net-next |
Previously (hopefully), we saw many CI flakes which were due to the first request from outside to k8s Service failing. E.g., Can not connect to service "http://192.168.37.11:30146" from outside cluster After some investigation it became obvious why it happened. Cilium-agent becomes ready before the host endpoints get regenerated (e.g., bpf_netdev_eth0.o). This leads to old programs to handling requests which might fail in different ways. For example the following request failed in the K8sServicesTest Checks_N_S_loadbalancing Tests_with_direct_routing_and_DSR suite: {..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ..., "trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...} The previous suite was running in the tunnel mode, so the old program was still trying to send the packet over the tunnel which no longer existed. This resulted in the silent drop. Fix this by making the CI to wait after deploying Cilium until the host EP is in the "ready" state. This should ensure that the host EP programs have been regenerated. Signed-off-by: Martynas Pumputis <m@lambda.lt>
|
/test |
1 similar comment
|
/test |
joestringer
left a comment
There was a problem hiding this comment.
Yep, endpoints first start in restoring state when they're restored from the filesystem, then they should transition through regenerating and become ready:
cilium/pkg/endpoint/endpoint.go
Line 852 in ab7ff52
cilium/pkg/endpoint/endpoint.go
Line 1273 in ab7ff52
|
@brb Would it be feasible to backport this to v1.10? |
|
@pchaigno Sure. Do you want me to do that or a tophat? |
|
I'll take care of it and ping you if I need help. |
Previously (hopefully), we saw many CI flakes which were due to the
first request from outside to k8s Service failing. E.g.,
After some investigation it became obvious why it happened. Cilium-agent
becomes ready before the host endpoints get regenerated (e.g.,
bpf_netdev_eth0.o). This leads to old programs to handling requests
which might fail in different ways. For example the following request
failed in the K8sServicesTest Checks_N_S_loadbalancing
Tests_with_direct_routing_and_DSR suite:
The previous suite was running in the tunnel mode, so the old program
was still trying to send the packet over the tunnel which no longer
existed. This resulted in the silent drop.
Fix this by making the CI to wait after deploying Cilium until the host
EP is in the "ready" state. This should ensure that the host EP programs
have been regenerated.
Fix #12511.