Skip to content

operator: add --leader-election-resource-lock-timeout flag#44500

Merged
aanm merged 1 commit intocilium:mainfrom
darox:add-configure-timeout-for-lease-lock-acquisition
Mar 5, 2026
Merged

operator: add --leader-election-resource-lock-timeout flag#44500
aanm merged 1 commit intocilium:mainfrom
darox:add-configure-timeout-for-lease-lock-acquisition

Conversation

@darox
Copy link
Copy Markdown
Contributor

@darox darox commented Feb 23, 2026

Add a new configurable flag --leader-election-resource-lock-timeout to
the cilium-operator that controls the HTTP client timeout used when
making API requests to acquire or renew the leader election resource
lock (Lease object in Kubernetes).

Problem:
The HTTP timeout for lease lock API calls is derived from the renew
deadline as max(1s, renewDeadline/2) by the upstream k8s client-go
resourcelock.NewFromKubeconfig() helper. With the default
--leader-election-renew-deadline of 10s, this yields a 5s HTTP timeout.
Users with high-latency control planes (e.g., worker nodes in a
different region than the remote control plane) frequently hit this
timeout, causing the operator to fail leader election with errors like:

error retrieving resource lock kube-system/cilium-operator-resource-lock:
net/http: request canceled while waiting for connection
(Client.Timeout exceeded while awaiting headers)

Previously, the only workaround was to increase
--leader-election-renew-deadline, which also changes the leader election
protocol timing semantics beyond just the HTTP timeout.

Solution:
Introduce --leader-election-resource-lock-timeout (type: duration,
default: 0) that directly controls the HTTP client timeout for lease
lock API requests, independently of the renew deadline.

When set to 0 (default), the existing behavior is preserved exactly:
timeout = max(1s, renewDeadline/2). When set to a positive duration,
that value is used directly as the HTTP client timeout.

Implementation details:

  • Added LeaderElectionResourceLockTimeout constant, struct field, and
    viper binding following the same pattern as the existing
    --leader-election-lease-duration, --leader-election-renew-deadline,
    and --leader-election-retry-period flags.
  • Replaced the call to resourcelock.NewFromKubeconfig() with a manual
    resource lock construction using resourcelock.New(). This gives us
    control over the rest.Config.Timeout value passed to the Kubernetes
    client used for leader election, while replicating the exact same
    logic from the upstream helper (shallow copy of kubeconfig, user
    agent annotation, NewForConfigOrDie).
  • The default behavior (timeout=0) faithfully reproduces the upstream
    formula: timeout = max(1s, renewDeadline/2).

Usage example:
cilium-operator --leader-election-resource-lock-timeout=15s

Files changed:

  • operator/option/config.go: constant, struct field, Populate()
  • operator/cmd/flags.go: flag registration with BindEnv
  • operator/cmd/root.go: manual resource lock creation with timeout

@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 85ee03c does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Feb 23, 2026
@darox darox force-pushed the add-configure-timeout-for-lease-lock-acquisition branch from 85ee03c to fb2016e Compare February 23, 2026 14:19
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Feb 23, 2026
@darox darox force-pushed the add-configure-timeout-for-lease-lock-acquisition branch 3 times, most recently from 63d4363 to 6710ef4 Compare February 23, 2026 15:23
@darox
Copy link
Copy Markdown
Contributor Author

darox commented Feb 23, 2026

Closing as AI policy prohibits it.

@darox darox closed this Feb 23, 2026
@darox darox reopened this Feb 23, 2026
@darox darox force-pushed the add-configure-timeout-for-lease-lock-acquisition branch 2 times, most recently from 990ceec to 603025c Compare February 26, 2026 14:15
@darox darox marked this pull request as ready for review February 26, 2026 16:50
@darox darox requested a review from a team as a code owner February 26, 2026 16:50
@darox darox requested a review from nebril February 26, 2026 16:50
@sayboras
Copy link
Copy Markdown
Member

/test

Add a new configurable flag --leader-election-resource-lock-timeout to
the cilium-operator that controls the HTTP client timeout used when
making API requests to acquire or renew the leader election resource
lock (Lease object in Kubernetes).

Problem:
The HTTP timeout for lease lock API calls is derived from the renew
deadline as max(1s, renewDeadline/2) by the upstream k8s client-go
resourcelock.NewFromKubeconfig() helper. With the default
--leader-election-renew-deadline of 10s, this yields a 5s HTTP timeout.
Users with high-latency control planes (e.g., worker nodes in a
different region than the remote control plane) frequently hit this
timeout, causing the operator to fail leader election with errors like:

  error retrieving resource lock kube-system/cilium-operator-resource-lock:
  net/http: request canceled while waiting for connection
  (Client.Timeout exceeded while awaiting headers)

Previously, the only workaround was to increase
--leader-election-renew-deadline, which also changes the leader election
protocol timing semantics beyond just the HTTP timeout.

Solution:
Introduce --leader-election-resource-lock-timeout (type: duration,
default: 0) that directly controls the HTTP client timeout for lease
lock API requests, independently of the renew deadline.

When set to 0 (default), the existing behavior is preserved exactly:
timeout = max(1s, renewDeadline/2). When set to a positive duration,
that value is used directly as the HTTP client timeout.

Implementation details:
- Added LeaderElectionResourceLockTimeout constant, struct field, and
  viper binding following the same pattern as the existing
  --leader-election-lease-duration, --leader-election-renew-deadline,
  and --leader-election-retry-period flags.
- Replaced the call to resourcelock.NewFromKubeconfig() with a manual
  resource lock construction using resourcelock.New(). This gives us
  control over the rest.Config.Timeout value passed to the Kubernetes
  client used for leader election, while replicating the exact same
  logic from the upstream helper (shallow copy of kubeconfig, user
  agent annotation, NewForConfigOrDie).
- The default behavior (timeout=0) faithfully reproduces the upstream
  formula: timeout = max(1s, renewDeadline/2).

Usage example:
  cilium-operator --leader-election-resource-lock-timeout=15s

Files changed:
- operator/option/config.go: constant, struct field, Populate()
- operator/cmd/flags.go: flag registration with BindEnv
- operator/cmd/root.go: manual resource lock creation with timeout

Claude Opus 4.6 was used to assist in the
development of this commit.

Fixes: cilium#38144
Signed-off-by: darox <maderdario@gmail.com>
@darox darox force-pushed the add-configure-timeout-for-lease-lock-acquisition branch from 603025c to a8cae63 Compare March 2, 2026 12:11
@aanm aanm added the release-note/minor This PR changes functionality that users may find relevant to operating Cilium. label Mar 4, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 4, 2026
@aanm aanm enabled auto-merge March 4, 2026 09:25
@aanm
Copy link
Copy Markdown
Member

aanm commented Mar 4, 2026

/test

2 similar comments
@darox
Copy link
Copy Markdown
Contributor Author

darox commented Mar 5, 2026

/test

@darox
Copy link
Copy Markdown
Contributor Author

darox commented Mar 5, 2026

/test

@aanm aanm added this pull request to the merge queue Mar 5, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 5, 2026
Merged via the queue into cilium:main with commit be7a8ec Mar 5, 2026
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants