Skip to content

service: Fix wrong localEndpoints count in HealthCheckNodePort#11863

Merged
tgraf merged 1 commit intomasterfrom
pr/gandro/fix-healthchecknodeport-invalid-count
Jun 4, 2020
Merged

service: Fix wrong localEndpoints count in HealthCheckNodePort#11863
tgraf merged 1 commit intomasterfrom
pr/gandro/fix-healthchecknodeport-invalid-count

Conversation

@gandro
Copy link
Copy Markdown
Member

@gandro gandro commented Jun 3, 2020

This fixes an issue with the HealthCheckNodePort server where it
would non-deterministically sometimes return a non-zero
localEndpoints count on nodes which do not have local endpoints.

Because Cilium internally creates a service object per frontend IP, we
end up with multiple services sharing the same name. In the case where
a LoadBalancer service has externalTrafficPolicy=Local with no
local backends, Cilium will still create a ClusterIP sibling service
which retains the non-local backends. In that case, we must take care
to not incooperate the ClusterIP backends into the localEndpoints
count intended for external traffic. The final count is dependent on
the order in which services are added to the service manager, which
explains why the occurence of this bug was non-deterministic.

This commit fixes this issue by checking that the service may only
contain local backends before its count is added to the
HealthCheckNodePort server. The unit tests are adapated as well and
try to emulate the way the K8s watcher upserts services in the service
manager.

Fixes: #11043

Signed-off-by: Sebastian Wicki sebastian@isovalent.com

This fixes an issue with the `HealthCheckNodePort` server where it
would non-deterministically sometimes return a non-zero
`localEndpoints` count on nodes which do not have local endpoints.

Because Cilium internally creates a service object per frontend IP, we
end up with multiple services sharing the same name. In the case where
a `LoadBalancer` service has `externalTrafficPolicy=Local` with no
local backends, Cilium will still create a `ClusterIP` sibling service
which retains the non-local backends. In that case, we must take care
to not incooperate the `ClusterIP` backends into the `localEndpoints`
count intended for external traffic. The final count is dependent on
the order in which services are added to the service manager, which
explains why the occurence of this bug was non-deterministic.

This commit fixes this issue by checking that the service may only
contain local backends before its count is added to the
`HealthCheckNodePort` server. The unit tests are adapated as well and
try to emulate the way the K8s watcher upserts services in the service
manager.

Fixes: #11043

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro added kind/bug This is a bug in the Cilium logic. priority/release-blocker area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jun 3, 2020
@gandro gandro requested review from a team and brb June 3, 2020 14:12
@gandro
Copy link
Copy Markdown
Member Author

gandro commented Jun 3, 2020

test-me-please

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.0004%) to 36.916% when pulling e699b5e on pr/gandro/fix-healthchecknodeport-invalid-count into bdf98cb on master.

Copy link
Copy Markdown
Member

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 4, 2020
@tgraf tgraf merged commit f7b0378 into master Jun 4, 2020
@tgraf tgraf deleted the pr/gandro/fix-healthchecknodeport-invalid-count branch June 4, 2020 09:39
@christarazi
Copy link
Copy Markdown
Member

christarazi commented Jun 5, 2020

I attempted to backport this PR, but ran into non-trivial conflicts. It seems like it depends on #11085. Even though this PR was cherry-picked cleanly (except for the nodeTypes package import), it fails to compile as it depends on a different function signature for:

m.svc.UpsertService

Edit: To clarify, the conflicts are in pkg/service/service_test.go. I've dropped the test changes and kept the implementation intact, which includes the fix this PR has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/k8s Impacts the kubernetes API, or kubernetes -> cilium internals translation layers. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

traffic policy local health check returns wrong status

9 participants