Skip to content

fix: reallocate envoy port on binding failure#42859

Merged
jrajahalme merged 1 commit intocilium:mainfrom
inerplat:fix/reallocate-envoy-port
Jan 26, 2026
Merged

fix: reallocate envoy port on binding failure#42859
jrajahalme merged 1 commit intocilium:mainfrom
inerplat:fix/reallocate-envoy-port

Conversation

@inerplat
Copy link
Copy Markdown
Contributor

@inerplat inerplat commented Nov 18, 2025

Fixes: #42858

This PR adds:

  1. Port binding error detection (isPortBindingError): Detects port binding failures by checking error messages for common indicators like "cannot bind", "address already in use", and "eaddrinuse".

  2. Port reallocation function (AllocateCRDProxyPortWithReallocate): Forces reallocation of a new port by resetting both ProxyPort and rulesPort when forceReallocate is true, ensuring a truly new random port is allocated.

  3. Retry logic (retryWithNewPorts): When a port binding failure is detected:

    • Reallocates a new port for affected listeners
    • Clones only the listeners that need port reallocation (for efficiency)
    • Updates listener addresses with the new port
    • Updates port allocation callbacks
    • Retries the Envoy resource update
  4. Integration (Update method): Integrates the retry logic into the Envoy reconciler's update flow.

Testing

  • Tested manually by reproducing the issue scenario
  • Verified that port reallocation works correctly
  • Confirmed that iptables rules are updated with the new port
  • Ensured backward compatibility (existing functionality unchanged)
Fix: Cilium Ingress now automatically reallocates ports and retries when cilium-envoy fails to bind due to port conflicts

@inerplat inerplat requested review from a team as code owners November 18, 2025 15:44
@inerplat inerplat requested a review from sayboras November 18, 2025 15:44
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 18, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Nov 18, 2025
@youngnick youngnick self-requested a review November 19, 2025 00:46
@pchaigno pchaigno added release-note/bug This PR fixes an issue in a previous release of Cilium. feature/k8s-ingress labels Nov 24, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Nov 24, 2025
@pchaigno
Copy link
Copy Markdown
Member

/test

@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 3e33e37 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 24, 2025
@inerplat inerplat force-pushed the fix/reallocate-envoy-port branch from 3e33e37 to 158060f Compare November 24, 2025 16:34
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 24, 2025
@inerplat
Copy link
Copy Markdown
Contributor Author

Oops, I missed the test code for the newly added function.

I have re-verified the previously failing test suites:

  • lint-go was executed locally and passed successfully.
  • Integration Tests (ci-integration) were run on my personal fork, and the workflow completed successfully

Could you please re-run the tests to confirm everything is working correctly? @pchaigno

@pchaigno
Copy link
Copy Markdown
Member

/test

@inerplat
Copy link
Copy Markdown
Contributor Author

@pchaigno
Regarding the failed Cilium E2E Upgrade (ci-e2e-upgrade) test, I verified that it passed successfully on my personal fork using the same commit SHA 158060f.

Successful run: https://github.com/inerplat/cilium/actions/runs/19655926227

The error appears to be a transient failure. Could you please re-run the failed job?

Failed job: https://github.com/cilium/cilium/actions/runs/19645818650/job/56260701695

Thanks!

@jrajahalme
Copy link
Copy Markdown
Member

@joamaki Could you take a look, especially if the Listener resource mutation (with the new proxy port) happens in the right place w.r.t. statedb read-only data?

@jrajahalme
Copy link
Copy Markdown
Member

/test

@inerplat inerplat force-pushed the fix/reallocate-envoy-port branch from f279633 to de93437 Compare January 6, 2026 13:32
@inerplat
Copy link
Copy Markdown
Contributor Author

inerplat commented Jan 6, 2026

@jrajahalme
I’ve rebased my branch to fix the CI failure caused by the missing flag in the upstream, so could you please trigger the tests again?

@aanm
Copy link
Copy Markdown
Member

aanm commented Jan 6, 2026

/test

1 similar comment
@aanm
Copy link
Copy Markdown
Member

aanm commented Jan 14, 2026

/test

Copy link
Copy Markdown
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, only small nits left. After addressing these, please squash the commits together to retain cleaner commit history when this is merged (we do not squash automatically on merge).

@inerplat inerplat force-pushed the fix/reallocate-envoy-port branch from de93437 to bf88d8d Compare January 24, 2026 04:28
@inerplat inerplat force-pushed the fix/reallocate-envoy-port branch from bf88d8d to 8194511 Compare January 24, 2026 04:30
@inerplat
Copy link
Copy Markdown
Contributor Author

inerplat commented Jan 24, 2026

@jrajahalme Thanks for the review. I have addressed all the comments and squashed the commits as requested.

Regarding the logging concern:

Also, should not log the error IF the caller is going to log the returned error anyway (did not check it that is the case).

I verified the behavior and confirmed that the returned error is not logged to stdout/stderr by the caller. Instead, I observed that the error is persisted in statedb through the existing error handling mechanism.
Therefore, I decided to keep the warn log when hasDynamicallyAllocatedPorts is true to explicitly indicate that a "port binding failed and a retry is in progress." Detailed root causes can be inspected via statedb (using cilium-dbg).

Here are the verification results from my local reproduction:

Case 1: Port conflict triggers reallocation (Success)
The warning logs appear as expected, and the port is successfully reallocated.

time=2026-01-24T04:26:08.744569461Z level=warn msg="NACK received for versions between the reported version up to the response nonce; waiting for a version update before sending again" module=agent.controlplane.envoy-proxy xdsStreamID=3 xdsClientNode=host~127.0.0.1~no-id~localdomain version=170 xdsTypeURL=[type.googleapis.com/envoy.config.listener.v3.Listener](https://type.googleapis.com/envoy.config.listener.v3.Listener) xdsNonce=171 xdsDetail="Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:17523': Address already in use\n"

time=2026-01-24T04:26:08.744682142Z level=warn msg="Port binding failed, attempting to reallocate ports and retry" module=agent.controlplane.ciliumenvoyconfig error="NACK received: Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:17523': Address already in use\n"

time=2026-01-24T04:26:08.760985838Z level=info msg="Reallocated proxy port due to binding failure" module=agent.controlplane.ciliumenvoyconfig listener=kube-system/cilium-ingress/listener proxyPort=14616

Case 2: Port range exhaustion (Error Persistence)
I simulated a scenario where no ports are available (--proxy-portrange-max=10000, --proxy-portrange-min=10000). The error persists and is correctly reflected in statedb.

Agent Logs:

time=2026-01-24T04:23:46.400139593Z level=warn msg="NACK received for versions between the reported version up to the response nonce; waiting for a version update before sending again" module=agent.controlplane.envoy-proxy xdsStreamID=4 xdsClientNode=host~127.0.0.1~no-id~localdomain version=160 xdsTypeURL=[type.googleapis.com/envoy.config.listener.v3.Listener](https://type.googleapis.com/envoy.config.listener.v3.Listener) xdsNonce=161 xdsDetail="Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:10000': Address already in use\n"

time=2026-01-24T04:23:46.400433595Z level=warn msg="Port binding failed, attempting to reallocate ports and retry" module=agent.controlplane.ciliumenvoyconfig error="NACK received: Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:10000': Address already in use\n"

StateDB Verification (cilium-dbg):

$ cilium-dbg shell -- db/show envoy-resources
Name                             Listeners                           Endpoints                           References                          Status   Since   Error
# ...
cec:kube-system/cilium-ingress   kube-system/cilium-ingress/listener                                                                     Error    10s     failed to reallocate ports after binding failure: failed to reallocate proxy port for listener kube-system/cilium-ingress/listener: no available proxy ports (original error: NACK received: Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:10000': Address already in use)

$ cilium-dbg statedb | grep failed
    {"ID":{"Module":["agent","controlplane","ciliumenvoyconfig"],"Component":["job-reconcile"]},"Level":"Degraded","Message":"1 error(s)","Error":"failed to reallocate ports after binding failure: failed to reallocate proxy port for listener kube-system/cilium-ingress/listener: no available proxy ports (original error: NACK received: Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:10000': Address already in use\n)","LastOK":"0001-01-01T00:00:00Z","Updated":"2026-01-24T04:19:31.152330937Z","Stopped":"0001-01-01T00:00:00Z","Final":"","Count":165},
    {"Name":{"Origin":"cec","Cluster":"","Namespace":"kube-system","Name":"cilium-ingress"},"Status":{"updated-at":"2026-01-24T04:19:31.150824728Z","error":"failed to reallocate ports after binding failure: failed to reallocate proxy port for listener kube-system/cilium-ingress/listener: no available proxy ports (original error: NACK received: Error adding/updating listener(s) kube-system/cilium-ingress/listener: cannot bind '127.0.0.1:10000': Address already in use\n)","id":543,"kind":"Error"}
# ...

@inerplat inerplat requested a review from jrajahalme January 24, 2026 04:42
Signed-off-by: DH Kim <inerplat@gmail.com>
@inerplat inerplat force-pushed the fix/reallocate-envoy-port branch from 8194511 to eceb8de Compare January 24, 2026 05:00
@jrajahalme
Copy link
Copy Markdown
Member

/test

@jrajahalme jrajahalme enabled auto-merge January 26, 2026 10:43
@jrajahalme jrajahalme added this pull request to the merge queue Jan 26, 2026
Merged via the queue into cilium:main with commit dfdefc5 Jan 26, 2026
75 of 76 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature/k8s-ingress kind/community-contribution This was a contribution made by a community member. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cilium Ingress fails to recover from port binding errors

5 participants