Skip to content

loadbalancer: Fix GetInstancesOfService to avoid removing endpoint from Service A cause requests to Service B fail if the name of Service A is the prefix of Service B#43620

Merged
joestringer merged 1 commit intocilium:mainfrom
imroc:fix-43619
Jan 8, 2026

Conversation

@imroc
Copy link
Copy Markdown
Contributor

@imroc imroc commented Jan 8, 2026

The current GetInstancesOfService function returns all Services that prefix with the specified name, which can lead to removing endpoint from Service A cause requests to Service B fail if the name of Service A is the prefix of Service B (#43619).

This patch will fix the matching logic of GetInstancesOfService, ensuring an exact match for the service.

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

When an endpoint is removed from EndpointSlice, it will proceed to the backendRelease function to release the corresponding backend:

// pkg/loadbalancer/writer/writer.go
func backendRelease(be *loadbalancer.Backend, name loadbalancer.ServiceName) (*loadbalancer.Backend, bool) {
	instances := be.Instances
	if be.Instances.Len() == 1 {
		for k := range be.Instances.All() {
			if k.ServiceName == name {
				return nil, true
			}
		}
	}
	// delete instances if the service name is matched
	for k := range be.GetInstancesOfService(name) {
		instances = instances.Delete(k)
	}
	beCopy := *be
	beCopy.Instances = instances
	return &beCopy, beCopy.Instances.Len() == 0
}

Let's take a look at the GetInstancesOfService:

// pkg/loadbalancer/backend.go
func (be *Backend) GetInstancesOfService(name ServiceName) iter.Seq2[BackendInstanceKey, BackendParams] {
	return be.Instances.Prefix(BackendInstanceKey{ServiceName: name, SourcePriority: 0})
}

type BackendInstanceKey struct {
	ServiceName    ServiceName
	SourcePriority uint8
}

func (k BackendInstanceKey) Key() []byte {
	if k.SourcePriority == 0 {
		return k.ServiceName.Key()
	}
	sk := k.ServiceName.Key()
	buf := make([]byte, 0, 2+len(sk))
	buf = append(buf, sk...)
	return append(buf, ' ', k.SourcePriority)
}

The returned []byte of BackendInstanceKey.Key always starts with the service name.
It means that Service B will be matched if its' name starts with Service A when GetInstancesOfService(Service A).

This is also the root cause described in #43619

This PR fixes the GetInstancesOfService, use the exact match for service instead of prefix match.

Fixes: #43619

loadbalancer: Fix GetInstancesOfService to avoid removing an endpoint from Service A causes all requests to Service B to fail if the name of Service A is the prefix of Service B

@imroc imroc requested a review from a team as a code owner January 8, 2026 08:07
@imroc imroc requested a review from dylandreimerink January 8, 2026 08:07
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 8, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 8, 2026
@imroc imroc changed the title loadbalancer: Fix GetInstancesOfService to avoid removing an endpoint from Service A causes all requests to Service B to fail if the name of Service A is the prefix of Service B loadbalancer: Fix GetInstancesOfService to avoid removing endpoint from Service A cause requests to Service B fail if the name of Service A is the prefix of Service B Jan 8, 2026
@joamaki
Copy link
Copy Markdown
Contributor

joamaki commented Jan 8, 2026

The BackendInstanceKey uses space ( ) as separator between the name and the priority. We can fix this issue either the way done in the PR (do prefix search and check the name for full match) or we can change BackendInstanceKey.Key() to include the space in the returned key. I'm fine with either.

Could you look into extending the tests so we have a regression test for this? Perhaps as unit test in https://github.com/cilium/cilium/blob/main/pkg/loadbalancer/writer/writer_test.go (TestWriter_WithConflictingSources seems pretty close?) or as a new script test in pkg/loadbalancer/tests/testdata.

@joamaki joamaki added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 8, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 8, 2026
@joamaki joamaki added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jan 8, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 8, 2026
@joamaki
Copy link
Copy Markdown
Contributor

joamaki commented Jan 8, 2026

/test

Copy link
Copy Markdown
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add regression test to this

@imroc
Copy link
Copy Markdown
Contributor Author

imroc commented Jan 8, 2026

The BackendInstanceKey uses space ( ) as separator between the name and the priority. We can fix this issue either the way done in the PR (do prefix search and check the name for full match) or we can change BackendInstanceKey.Key() to include the space in the returned key. I'm fine with either.

Could you look into extending the tests so we have a regression test for this? Perhaps as unit test in https://github.com/cilium/cilium/blob/main/pkg/loadbalancer/writer/writer_test.go (TestWriter_WithConflictingSources seems pretty close?) or as a new script test in pkg/loadbalancer/tests/testdata.

change BackendInstanceKey.Key() to include the space in the returned key seems more efficient, will change to that.

@imroc
Copy link
Copy Markdown
Contributor Author

imroc commented Jan 8, 2026

Let's add regression test to this

I've changed BackendInstanceKey.Key() to include the space in the returned key, it has been verified that this can resolve this issue.

Since only the BackendInstanceKey was modified, I've modified the test case for this scenario in TestBackendInstanceKey in backend_test.go.

The current `GetInstancesOfService` function returns all Services that
begin with the name, which can lead to removing an endpoint from Service
A's EndpointSlice will cause all requests to Service B to fail (If the
name of Service A is the prefix of Service B).

This patch will fix the matching logic of GetInstancesOfService,
ensuring an exact match for the service.

Fixes: cilium#43619

Signed-off-by: roc <roc@imroc.cc>
@joamaki
Copy link
Copy Markdown
Contributor

joamaki commented Jan 8, 2026

/test

@imroc
Copy link
Copy Markdown
Contributor Author

imroc commented Jan 8, 2026

FYI: I am an engineer from the TKE (Tencent Kubernetes Engine) team. The original issue was that after installing Cilium on TKE, if the cluster specifications were adjusted automatically or manually (triggering an apiserver rolling update), almost all Cilium components in the cluster would become unresponsive and unable to reconnect to the apiserver.

The root cause of the problem was eventually traced to this issue. The detailed troubleshooting process can be found at here(translated by AI): https://imroc.cc/tke/en/networking/cilium/troubleshooting/connect-apiserver-operation-not-permitted

@xtineskim
Copy link
Copy Markdown
Member

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 8, 2026
@joestringer joestringer added this pull request to the merge queue Jan 8, 2026
Merged via the queue into cilium:main with commit 3db9377 Jan 8, 2026
78 checks passed
@gandro gandro mentioned this pull request Jan 15, 2026
4 tasks
@gandro gandro added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jan 15, 2026
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Jan 15, 2026
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

BUG: Removing endpoint from Service A cause requests to Service B fail if the name of Service A is the prefix of Service B

6 participants