Skip to content

helm: let clustermesh-apiserver etcd listen on all IP addresses #42818

Merged
giorio94 merged 1 commit intocilium:mainfrom
giorio94:mio/cmapisrv-addrs
Nov 18, 2025
Merged

helm: let clustermesh-apiserver etcd listen on all IP addresses #42818
giorio94 merged 1 commit intocilium:mainfrom
giorio94:mio/cmapisrv-addrs

Conversation

@giorio94
Copy link
Copy Markdown
Member

@giorio94 giorio94 commented Nov 17, 2025

Currently, the sidecar etcd instance of the clustermesh-apiserver is explicitly configured to listen for client connections on localhost and the primary Pod IP (as reported by status.podIP). To this end, the Pod IP was enclosed in brackets, as that historically used to work properly both in case of IPv4 and IPv6 addresses. However, this is no longer the case starting from etcd v3.6.6, which is compiled using 1.24.10, in turn implementing a stricter validation forbidding to enclose IPv4 addresses in brackets [1,2], as not RFC compliant.

In order to bypass this problem, let's change the lister-client-urls parameter to listen on all IP addresses; this has the advantage of not requiring the brackets enclosing, and also enables listening on the secondary Pod IP address, if any. We configure 0.0.0.0 (rather than [::]) as the go implementation of Listen already automatically selects the IP family as appropriate [3]. Similarly, we also change listen-metrics-urls to listen on all IP addresses, hence including both localhost and possible secondary Pod IPs. Again, this is not expected to have any negative consequence.

Slightly trickier is the advertise-client-urls setting, which is supposed to advertise the addresses that clients can use to reach this replica. Yet, its value doesn't strictly matter in this context, given that the etcd cluster is composed of a single replica, and clients directly connect to it via the address provided as part of their configuration. Additionally, the provided values appear to be used only when auto synchronization is enabled [4,5], which is not in this case. Still, even advertising the pod IP was not fully correct there, as not directly reachable from remote clusters. Hence, let's just configure localhost instead, as a placeholder, given that the parameter is always required to be supplied.

Related: #42804 (comment)

[1]: https://go-review.googlesource.com/c/go/+/709838
[2]: https://go-review.googlesource.com/c/go/+/712142
[3]: https://cs.opensource.google/go/go/+/master:src/net/ipsock_posix.go;l=83-157?q=ipsock_posi
[4]:

func (c *Client) autoSync() {
if c.cfg.AutoSyncInterval == time.Duration(0) {
return
}
for {
select {
case <-c.ctx.Done():
return
case <-time.After(c.cfg.AutoSyncInterval):
ctx, cancel := context.WithTimeout(c.ctx, 5*time.Second)
err := c.Sync(ctx)
cancel()
if err != nil && !errors.Is(err, c.ctx.Err()) {
c.lg.Info("Auto sync endpoints failed.", zap.Error(err))
}
}
}
}

[5]:
// Sync synchronizes client's endpoints with the known endpoints from the etcd membership.
func (c *Client) Sync(ctx context.Context) error {
mresp, err := c.MemberList(ctx)
if err != nil {
return err
}
var eps []string
for _, m := range mresp.Members {
if len(m.Name) != 0 && !m.IsLearner {
eps = append(eps, m.ClientURLs...)
}
}
// The linearizable `MemberList` returned successfully, so the
// endpoints shouldn't be empty.
verify.Verify(func() {
if len(eps) == 0 {
panic("empty endpoints returned from etcd cluster")
}
})
c.SetEndpoints(eps...)
c.lg.Debug("set etcd endpoints by autoSync", zap.Strings("endpoints", eps))
return nil
}

Change the sidecar etcd instance of the Cluster Mesh API Server listen on all IP addresses

@maintainer-s-little-helper maintainer-s-little-helper Bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 17, 2025
@giorio94
Copy link
Copy Markdown
Member Author

/ci-clustermesh

Currently, the sidecar etcd instance of the clustermesh-apiserver is
explicitly configured to listen for client connections on localhost
and the primary Pod IP (as reported by `status.podIP`). To this end,
the Pod IP was enclosed in brackets, as that historically used to
work properly both in case of IPv4 and IPv6 addresses. However,
this is no longer the case starting from etcd v3.6.6, which is
compiled using 1.24.10, in turn implementing a stricter validation
forbidding to enclose IPv4 addresses in brackets [1,2], as not
RFC compliant.

In order to bypass this problem, let's change the `lister-client-urls`
parameter to listen on all IP addresses; this has the advantage of
not requiring the brackets enclosing, and also enables listening on
the secondary Pod IP address, if any. We configure 0.0.0.0 (rather
than [::]) as the go implementation of Listen already automatically
selects the IP family as appropriate [3]. Similarly, we also change
`listen-metrics-urls` to listen on all IP addresses, hence including
both localhost and possible secondary Pod IPs. Again, this is not
expected to have any negative consequence.

Slightly trickier is the `advertise-client-urls` setting, which is
supposed to advertise the addresses that clients can use to reach
this replica. Yet, its value doesn't strictly matter in this context,
given that the etcd cluster is composed of a single replica, and
clients directly connect to it via the address provided as part of
their configuration. Additionally, the provided values appear to
be used only when auto synchronization is enabled [4,5], which is
not in this case. Still, even advertising the pod IP was not fully
correct there, as not directly reachable from remote clusters.
Hence, let's just configure localhost instead, as a placeholder,
given that the parameter is always required to be supplied.

[1]: https://go-review.googlesource.com/c/go/+/709838
[2]: https://go-review.googlesource.com/c/go/+/712142
[3]: https://cs.opensource.google/go/go/+/master:src/net/ipsock_posix.go;l=83-157?q=ipsock_posi
[4]: https://github.com/cilium/cilium/blob/f2686609331670df72056a485640bc4b02f5343d/vendor/go.etcd.io/etcd/client/v3/client.go#L210-L228
[5]: https://github.com/cilium/cilium/blob/f2686609331670df72056a485640bc4b02f5343d/vendor/go.etcd.io/etcd/client/v3/client.go#L186-L208

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 changed the title helm: let clustermesh-apiserver etcd listen to all IP addresses helm: let clustermesh-apiserver etcd listen on all IP addresses Nov 17, 2025
@giorio94 giorio94 added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. labels Nov 17, 2025
@maintainer-s-little-helper maintainer-s-little-helper Bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 17, 2025
@giorio94
Copy link
Copy Markdown
Member Author

/test

@giorio94 giorio94 requested a review from marseel November 17, 2025 13:21
@giorio94 giorio94 marked this pull request as ready for review November 17, 2025 13:21
@giorio94 giorio94 requested review from a team as code owners November 17, 2025 13:21
@giorio94 giorio94 requested a review from squeed November 17, 2025 13:21
Copy link
Copy Markdown
Member

@marseel marseel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account that we will be facing the same issue of upgrading the Etcd on other minor versions (1.18,1.17,1.16) I would like to propose:

  • Backport this PR to 1.18 to ensure that we use latest Etcd version on 1.18
  • It will allow us to build up confidence that this change does not break anything 1.18
  • Backport it in two-three months to 1.17 and 1.16

This will ensure that we are keeping Etcd version up to date on other minor versions as well. In case of any CVE, we would need to upgrade Etcd on other minor versions anyway. This sounds like reasonable trade off to make. WDYT?

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Nov 18, 2025
@giorio94
Copy link
Copy Markdown
Member Author

Taking into account that we will be facing the same issue of upgrading the Etcd on other minor versions (1.18,1.17,1.16) I would like to propose:

* Backport this PR to 1.18 to ensure that we use latest Etcd version on 1.18

* It will allow us to build up confidence that this change does not break anything 1.18

* Backport it in two-three months to 1.17 and 1.16

This will ensure that we are keeping Etcd version up to date on other minor versions as well. In case of any CVE, we would need to upgrade Etcd on other minor versions anyway. This sounds like reasonable trade off to make. WDYT?

As chatted offline, this plan makes total sense to me. Thanks!

@giorio94 giorio94 added this pull request to the merge queue Nov 18, 2025
@giorio94 giorio94 added the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Nov 18, 2025
Merged via the queue into cilium:main with commit 14c2403 Nov 18, 2025
82 of 83 checks passed
@giorio94 giorio94 deleted the mio/cmapisrv-addrs branch November 18, 2025 10:42
giorio94 added a commit to giorio94/cilium that referenced this pull request Nov 18, 2025
Temporarily disable all etcd updates on v1.16 and v1.17, because newer
versions are compiled with go 1.24.8 or later, which includes a breaking
change in IP addresses parsing affecting the clustermesh-apiserver usage.
See cilium#42818 for more details and the fix, which is intended
to be backported once we built up sufficient confidence.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Copy Markdown
Member Author

Backport it in two-three months to 1.17 and 1.16

PR updating the renovate configuration to not bump etcd in v1.16 and v1.17: #42847

github-merge-queue Bot pushed a commit that referenced this pull request Nov 18, 2025
Temporarily disable all etcd updates on v1.16 and v1.17, because newer
versions are compiled with go 1.24.8 or later, which includes a breaking
change in IP addresses parsing affecting the clustermesh-apiserver usage.
See #42818 for more details and the fix, which is intended
to be backported once we built up sufficient confidence.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@mhofstetter mhofstetter mentioned this pull request Nov 24, 2025
13 tasks
@mhofstetter mhofstetter added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Nov 24, 2025
@github-actions github-actions Bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Nov 24, 2025
@cilium-release-bot cilium-release-bot Bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

5 participants