Skip to content

Fix backend slot index mismatch in LB reconciler#43902

Merged
joestringer merged 1 commit intocilium:mainfrom
Aman-Cool:fix/lb-backend-slot-index
Feb 10, 2026
Merged

Fix backend slot index mismatch in LB reconciler#43902
joestringer merged 1 commit intocilium:mainfrom
Aman-Cool:fix/lb-backend-slot-index

Conversation

@Aman-Cool
Copy link
Copy Markdown
Contributor

@Aman-Cool Aman-Cool commented Jan 21, 2026

Summary

Fix a bug in the load balancer BPF reconciler where backend slots were incorrectly assigned when maintenance backends exist. The code used the loop index (i + 1) instead of the sequential slot counter (slotID) when setting backend slots, causing gaps in slot assignments when maintenance backends are skipped.

Bug: When backends in maintenance state are skipped during reconciliation, subsequent backends are assigned to wrong slots (e.g., slots 1, 3 instead of 1, 2), causing traffic misrouting or drops.

Fix: Use slotID (which correctly tracks sequential slot numbers) instead of i + 1 (loop index).

Release Note

Fixed a bug in service load balancing where backend slot assignments could have gaps when maintenance backends exist, potentially causing traffic misrouting.

Test Plan

  • Verified the fix aligns with existing code intent (debug log at line 944 already uses slotID)
  • The comment at line 940-941 states "the slot ids here are sequential" confirming expected behavior

@Aman-Cool Aman-Cool requested a review from a team as a code owner January 21, 2026 14:25
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 7239fcd does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 21, 2026
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Jan 21, 2026
@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from 7239fcd to 282ddbe Compare January 21, 2026 14:27
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Jan 21, 2026
@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from 282ddbe to 01bceb5 Compare January 22, 2026 11:40
@bersoare
Copy link
Copy Markdown
Contributor

bersoare commented Jan 22, 2026

code makes sense to me. but can i ask you to adjust the commit to be something like loadbalancer: fix backend slot index mismatch in the reconciler and add a commit comment (a brief version of the pr description you put)

edit: somehow i missed the commit comment. you already have it there. but can you adjust the description, still?

@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from 01bceb5 to 9d7ee34 Compare January 22, 2026 12:03
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

I’ve updated the commit title to loadbalancer: fix backend slot index mismatch in the reconciler and adjusted the commit description to be a brief version of the PR summary.

@bersoare
Copy link
Copy Markdown
Contributor

/test

@bersoare
Copy link
Copy Markdown
Contributor

im looking into these test failures

@bersoare
Copy link
Copy Markdown
Contributor

@Aman-Cool , looks like the test failures are related to #43935
can you rebase from main and push it again?

@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from 9d7ee34 to 282ddbe Compare January 23, 2026 09:48
@Aman-Cool
Copy link
Copy Markdown
Contributor Author

@bersoare , Rebased on main and force-pushed.

@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from 282ddbe to f3bdce7 Compare January 24, 2026 10:48
@bersoare
Copy link
Copy Markdown
Contributor

https://github.com/cilium/cilium/actions/runs/21313926203/job/61449143557?pr=43902 < this one failed on connectivity tests. need to take a look at the sysdump and see what's going on.

@Aman-Cool
Copy link
Copy Markdown
Contributor Author

Thanks for flagging this @bersoare, I’ll take a look on my side as well and see if anything stands out.

@bersoare
Copy link
Copy Markdown
Contributor

if you're interested, i encourage you to download the sysdump and take a look at what's there and see if you can spot any signs that the code isnt behaving as you expect. it's a great opportunity to get familiar with how cilium works, in case that is something you'd like to do.

i'll take a look when i get a chance and see if i can find something

@Aman-Cool
Copy link
Copy Markdown
Contributor Author

Thanks for the suggestion! I’ll take a look at the sysdump and see if anything stands out related to this change. Appreciate you taking a look as well.

@bersoare
Copy link
Copy Markdown
Contributor

im seeing what it looks like envoy errors:

2026-01-26T09:07:18.653986561Z time=2026-01-26T09:07:18.652958989Z level=debug source=/go/src/github.com/cilium/cilium/pkg/envoy/embedded_envoy.go:330 msg="[[Tags: \"ConnectionId\":\"336\"] current connecting state: true" module=agent.controlplane.envoy-proxy subsys=envoy-connection threadID=485
2026-01-26T09:07:18.653988915Z time=2026-01-26T09:07:18.652968947Z level=debug source=/go/src/github.com/cilium/cilium/pkg/envoy/embedded_envoy.go:330 msg="[[Tags: \"ConnectionId\":\"336\"] connecting" module=agent.controlplane.envoy-proxy subsys=envoy-client threadID=485
2026-01-26T09:07:18.653991089Z time=2026-01-26T09:07:18.652977804Z level=debug source=/go/src/github.com/cilium/cilium/pkg/envoy/embedded_envoy.go:330 msg="[[Tags: \"ConnectionId\":\"336\"] connecting to [fd00:10:244::d35f]:8080" module=agent.controlplane.envoy-proxy subsys=envoy-connection threadID=485
2026-01-26T09:07:18.653993263Z time=2026-01-26T09:07:18.652987802Z level=debug source=/go/src/github.com/cilium/cilium/pkg/envoy/embedded_envoy.go:330 msg="[[Tags: \"ConnectionId\":\"336\"] immediate connect error: Cannot assign requested address|remote address:[fd00:10:244::d35f]:8080" module=agent.controlplane.envoy-proxy subsys=envoy-connection threadID=485
2026-01-26T09:07:18.654001990Z time=2026-01-26T09:07:18.652997661Z level=debug source=/go/src/github.com/cilium/cilium/pkg/envoy/embedded_envoy.go:330 msg="[per-upstream shouldCreateNewConnection returns false for pending 1 active 0 connecting_and_connected_capacity 1 connecting_capacity 1 ratio 1" module=agent.controlplane.envoy-proxy subsys=envoy-pool threadID=485

not sure if this is the cause yet

@bersoare
Copy link
Copy Markdown
Contributor

/test

Copy link
Copy Markdown
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. But CI isn't in a good state right now. I recommend rebasing on main to pull in the latest changes that might effect it. Then we can dig into the new results to see if its this PR or something else.

Use slotID instead of loop index when setting backend slots to avoid
gaps when maintenance backends are skipped.

Signed-off-by: Aman-Cool <aman017102007@gmail.com>
@Aman-Cool Aman-Cool force-pushed the fix/lb-backend-slot-index branch from f3bdce7 to 85f1831 Compare February 10, 2026 09:04
@dylandreimerink
Copy link
Copy Markdown
Member

/test

@dylandreimerink dylandreimerink added the kind/bug This is a bug in the Cilium logic. label Feb 10, 2026
@dylandreimerink dylandreimerink added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Feb 10, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 10, 2026
@joestringer joestringer added this pull request to the merge queue Feb 10, 2026
@joestringer joestringer added the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Feb 10, 2026
Merged via the queue into cilium:main with commit 70d7751 Feb 10, 2026
75 of 76 checks passed
@glrf glrf mentioned this pull request Feb 17, 2026
12 tasks
@glrf glrf added backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. and removed needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Feb 17, 2026
@github-actions github-actions bot added backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. and removed backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. labels Feb 19, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

We didn't add a test for this bug, but the code in v1.18 looks like it's also affected (thanks @vipul-21 for the shout!). This feels worth fixing, @Aman-Cool @dylandreimerink any concerns? Adding the backport label for now.

@julianwiedmann julianwiedmann added needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Mar 20, 2026
@tklauser tklauser mentioned this pull request Mar 24, 2026
4 tasks
@tklauser tklauser added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Mar 24, 2026
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/loadbalancing Impacts load-balancing and Kubernetes service implementations backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants