operator: add --leader-election-resource-lock-timeout flag#44500
Merged
aanm merged 1 commit intocilium:mainfrom Mar 5, 2026
Merged
operator: add --leader-election-resource-lock-timeout flag#44500aanm merged 1 commit intocilium:mainfrom
aanm merged 1 commit intocilium:mainfrom
Conversation
|
Commit 85ee03c does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
85ee03c to
fb2016e
Compare
63d4363 to
6710ef4
Compare
Contributor
Author
|
Closing as AI policy prohibits it. |
990ceec to
603025c
Compare
Member
|
/test |
Add a new configurable flag --leader-election-resource-lock-timeout to the cilium-operator that controls the HTTP client timeout used when making API requests to acquire or renew the leader election resource lock (Lease object in Kubernetes). Problem: The HTTP timeout for lease lock API calls is derived from the renew deadline as max(1s, renewDeadline/2) by the upstream k8s client-go resourcelock.NewFromKubeconfig() helper. With the default --leader-election-renew-deadline of 10s, this yields a 5s HTTP timeout. Users with high-latency control planes (e.g., worker nodes in a different region than the remote control plane) frequently hit this timeout, causing the operator to fail leader election with errors like: error retrieving resource lock kube-system/cilium-operator-resource-lock: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Previously, the only workaround was to increase --leader-election-renew-deadline, which also changes the leader election protocol timing semantics beyond just the HTTP timeout. Solution: Introduce --leader-election-resource-lock-timeout (type: duration, default: 0) that directly controls the HTTP client timeout for lease lock API requests, independently of the renew deadline. When set to 0 (default), the existing behavior is preserved exactly: timeout = max(1s, renewDeadline/2). When set to a positive duration, that value is used directly as the HTTP client timeout. Implementation details: - Added LeaderElectionResourceLockTimeout constant, struct field, and viper binding following the same pattern as the existing --leader-election-lease-duration, --leader-election-renew-deadline, and --leader-election-retry-period flags. - Replaced the call to resourcelock.NewFromKubeconfig() with a manual resource lock construction using resourcelock.New(). This gives us control over the rest.Config.Timeout value passed to the Kubernetes client used for leader election, while replicating the exact same logic from the upstream helper (shallow copy of kubeconfig, user agent annotation, NewForConfigOrDie). - The default behavior (timeout=0) faithfully reproduces the upstream formula: timeout = max(1s, renewDeadline/2). Usage example: cilium-operator --leader-election-resource-lock-timeout=15s Files changed: - operator/option/config.go: constant, struct field, Populate() - operator/cmd/flags.go: flag registration with BindEnv - operator/cmd/root.go: manual resource lock creation with timeout Claude Opus 4.6 was used to assist in the development of this commit. Fixes: cilium#38144 Signed-off-by: darox <maderdario@gmail.com>
603025c to
a8cae63
Compare
nebril
approved these changes
Mar 3, 2026
Member
|
/test |
2 similar comments
Contributor
Author
|
/test |
Contributor
Author
|
/test |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a new configurable flag --leader-election-resource-lock-timeout to
the cilium-operator that controls the HTTP client timeout used when
making API requests to acquire or renew the leader election resource
lock (Lease object in Kubernetes).
Problem:
The HTTP timeout for lease lock API calls is derived from the renew
deadline as max(1s, renewDeadline/2) by the upstream k8s client-go
resourcelock.NewFromKubeconfig() helper. With the default
--leader-election-renew-deadline of 10s, this yields a 5s HTTP timeout.
Users with high-latency control planes (e.g., worker nodes in a
different region than the remote control plane) frequently hit this
timeout, causing the operator to fail leader election with errors like:
error retrieving resource lock kube-system/cilium-operator-resource-lock:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)
Previously, the only workaround was to increase
--leader-election-renew-deadline, which also changes the leader election
protocol timing semantics beyond just the HTTP timeout.
Solution:
Introduce --leader-election-resource-lock-timeout (type: duration,
default: 0) that directly controls the HTTP client timeout for lease
lock API requests, independently of the renew deadline.
When set to 0 (default), the existing behavior is preserved exactly:
timeout = max(1s, renewDeadline/2). When set to a positive duration,
that value is used directly as the HTTP client timeout.
Implementation details:
viper binding following the same pattern as the existing
--leader-election-lease-duration, --leader-election-renew-deadline,
and --leader-election-retry-period flags.
resource lock construction using resourcelock.New(). This gives us
control over the rest.Config.Timeout value passed to the Kubernetes
client used for leader election, while replicating the exact same
logic from the upstream helper (shallow copy of kubeconfig, user
agent annotation, NewForConfigOrDie).
formula: timeout = max(1s, renewDeadline/2).
Usage example:
cilium-operator --leader-election-resource-lock-timeout=15s
Files changed: