test: Wait until host EP is ready (=regenerated) by brb · Pull Request #18859 · cilium/cilium

brb · 2022-02-18T20:17:40Z

Previously (hopefully), we saw many CI flakes which were due to the
first request from outside to k8s Service failing. E.g.,

Can not connect to service "http://192.168.37.11:30146" from outside
cluster

After some investigation it became obvious why it happened. Cilium-agent
becomes ready before the host endpoints get regenerated (e.g.,
bpf_netdev_eth0.o). This leads to old programs to handling requests
which might fail in different ways. For example the following request
failed in the K8sServicesTest Checks_N_S_loadbalancing
Tests_with_direct_routing_and_DSR suite:

{..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ...,
"trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...}

The previous suite was running in the tunnel mode, so the old program
was still trying to send the packet over the tunnel which no longer
existed. This resulted in the silent drop.

Fix this by making the CI to wait after deploying Cilium until the host
EP is in the "ready" state. This should ensure that the host EP programs
have been regenerated.

Fix #12511.

brb · 2022-02-18T20:19:42Z

/test

brb · 2022-02-19T09:44:27Z

/test

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks service on same node

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

brb · 2022-02-19T09:45:24Z

@joestringer Requested you to review mainly to validate the state "ready" == the endpoint has been regenerated after the startup assumption.

brb · 2022-02-19T11:29:27Z

/test-gke

Job 'Cilium-PR-K8s-GKE' failed:

Click to show.

Test Name

K8sServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks service on same node

Failure Output

FAIL: Expected

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-GKE so I can create one.

brb · 2022-02-19T16:54:41Z

/test-gke

brb · 2022-02-19T16:54:54Z

/test-1.23-net-next

Job 'Cilium-PR-K8s-1.23-kernel-net-next' failed:

Click to show.

Test Name

K8sVerifier Runs the kernel verifier against Cilium's BPF datapath

Failure Output

FAIL: terminating containers are not deleted after timeout

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.23-kernel-net-next so I can create one.

brb · 2022-02-21T13:27:59Z

/test-1.23-net-next

Previously (hopefully), we saw many CI flakes which were due to the first request from outside to k8s Service failing. E.g., Can not connect to service "http://192.168.37.11:30146" from outside cluster After some investigation it became obvious why it happened. Cilium-agent becomes ready before the host endpoints get regenerated (e.g., bpf_netdev_eth0.o). This leads to old programs to handling requests which might fail in different ways. For example the following request failed in the K8sServicesTest Checks_N_S_loadbalancing Tests_with_direct_routing_and_DSR suite: {..., "IP":{"source":"192.168.56.13","destination":"10.0.1.105", ..., "trace_observation_point":"TO_OVERLAY","interface":{"index":40}, ...} The previous suite was running in the tunnel mode, so the old program was still trying to send the packet over the tunnel which no longer existed. This resulted in the silent drop. Fix this by making the CI to wait after deploying Cilium until the host EP is in the "ready" state. This should ensure that the host EP programs have been regenerated. Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb · 2022-02-21T15:07:20Z

/test

brb · 2022-02-21T16:20:54Z

/test

joestringer

Yep, endpoints first start in restoring state when they're restored from the filesystem, then they should transition through regenerating and become ready:

cilium/pkg/endpoint/endpoint.go

Line 852 in ab7ff52

ep.setState(StateRestoring, "Endpoint restoring")

cilium/pkg/endpoint/endpoint.go

Line 1273 in ab7ff52

func (e *Endpoint) setState(toState State, reason string) bool {

pchaigno · 2022-04-04T09:58:24Z

@brb Would it be feasible to backport this to v1.10?

brb · 2022-04-04T13:10:22Z

@pchaigno Sure. Do you want me to do that or a tophat?

pchaigno · 2022-04-04T13:15:38Z

I'll take care of it and ping you if I need help.

brb added area/CI Continuous Integration testing issue or flake release-note/ci This PR makes changes to the CI. labels Feb 18, 2022

brb force-pushed the pr/brb/ci-wait-until-host-ep-regenerated branch from ce1643c to 2c6f95d Compare February 19, 2022 09:42

brb marked this pull request as ready for review February 19, 2022 09:44

brb requested a review from a team as a code owner February 19, 2022 09:44

brb requested a review from nebril February 19, 2022 09:44

brb requested a review from joestringer February 19, 2022 09:44

brb force-pushed the pr/brb/ci-wait-until-host-ep-regenerated branch from 2c6f95d to dd32ad5 Compare February 21, 2022 13:26

brb mentioned this pull request Feb 22, 2022

CI: K8sServicesTest Checks service across nodes Tests NodePort BPF Tests with direct routing Tests LoadBalancer Connectivity to endpoint via LB #16399

Closed

joestringer approved these changes Feb 22, 2022

View reviewed changes

brb added needs-backport/1.11 ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Feb 23, 2022

nebril merged commit 3b9b098 into master Feb 23, 2022

nebril deleted the pr/brb/ci-wait-until-host-ep-regenerated branch February 23, 2022 07:58

nebril mentioned this pull request Feb 23, 2022

v1.11 backports 2022-02-23 #18905

Merged

nebril added backport-pending/1.11 and removed needs-backport/1.11 labels Feb 23, 2022

joestringer added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Mar 15, 2022

aanm mentioned this pull request Mar 26, 2022

Prepare for release v1.11.3 #19225

Merged

pchaigno mentioned this pull request Apr 4, 2022

v1.10 backports 2022-04-01 #19296

Merged

brb added the needs-backport/1.10 label Apr 4, 2022

pchaigno mentioned this pull request Apr 5, 2022

v1.10 backports 2022-04-05 #19331

Merged

pchaigno added backport-pending/1.10 and removed needs-backport/1.10 labels Apr 5, 2022

nbusseneau added backport-done/1.10 and removed backport-pending/1.10 labels Apr 12, 2022

joestringer mentioned this pull request Apr 15, 2022

Prepare for release v1.10.10 #19461

Merged

Conversation

brb commented Feb 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brb commented Feb 18, 2022

Uh oh!

brb commented Feb 19, 2022 • edited by maintainer-s-little-helper Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

brb commented Feb 19, 2022

Uh oh!

brb commented Feb 19, 2022 • edited by maintainer-s-little-helper Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

brb commented Feb 19, 2022

Uh oh!

brb commented Feb 19, 2022 • edited by maintainer-s-little-helper Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Name

Failure Output

Uh oh!

brb commented Feb 21, 2022

Uh oh!

brb commented Feb 21, 2022

Uh oh!

brb commented Feb 21, 2022

Uh oh!

joestringer left a comment

Choose a reason for hiding this comment

Uh oh!

pchaigno commented Apr 4, 2022

Uh oh!

brb commented Apr 4, 2022

Uh oh!

pchaigno commented Apr 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

brb commented Feb 18, 2022 •

edited

Loading

brb commented Feb 19, 2022 •

edited by maintainer-s-little-helper Bot

Loading

brb commented Feb 19, 2022 •

edited by maintainer-s-little-helper Bot

Loading

brb commented Feb 19, 2022 •

edited by maintainer-s-little-helper Bot

Loading