Skip to content

clustermesh: remove hostAliases and replace it by a custom dialer#43767

Closed
MrFreezeex wants to merge 4 commits intocilium:mainfrom
MrFreezeex:hostaliases-removal
Closed

clustermesh: remove hostAliases and replace it by a custom dialer#43767
MrFreezeex wants to merge 4 commits intocilium:mainfrom
MrFreezeex:hostaliases-removal

Conversation

@MrFreezeex
Copy link
Copy Markdown
Member

@MrFreezeex MrFreezeex commented Jan 14, 2026

See each commit

Fixes #42716

clustermesh: remove components restart when providing directly IP (no domain) to connect to remote clustermesh-apiserver

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 14, 2026
@MrFreezeex MrFreezeex added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 14, 2026
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex MrFreezeex force-pushed the hostaliases-removal branch 3 times, most recently from 8a7fc56 to 23a1484 Compare January 14, 2026 18:59
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex MrFreezeex force-pushed the hostaliases-removal branch 2 times, most recently from 8d3723f to b4f67a3 Compare January 16, 2026 12:52
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex MrFreezeex requested a review from giorio94 January 16, 2026 18:54
@MrFreezeex MrFreezeex marked this pull request as ready for review January 16, 2026 18:54
@MrFreezeex MrFreezeex requested review from a team as code owners January 16, 2026 18:54
@MrFreezeex MrFreezeex requested a review from gandro January 16, 2026 18:54
@MrFreezeex
Copy link
Copy Markdown
Member Author

I went with a slightly different approach than what we were discussing in the issue by using a dialer instead of something resolver related. It looks easier and that way I can have something really close to what we would have with the normal dialer. Let me know if that looks right and/or if you have some other suggestions 👀

// staticDbgDialer wraps an kvstore.EtcdDbgDialer with static host resolution support
// similarly to dial.NewStaticHostDialer used to contact remote clustermesh-apiserver
// but with a kvstore.EtcdDbgDialer interface.
type staticDbgDialer struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only underlying method used by this new type is .LookupIP(), it seems that this could have a more generic name as well, and be moved to it's own file. Something like staticDialerWithFallback.

Copy link
Copy Markdown
Member Author

@MrFreezeex MrFreezeex Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the "Dbg" reference the underlying interface EtcdDbgDialer so I am not sure it can't be much more generic name wise, I completed the name a bit by adding WithFallback like you suggested and making it EtcdDbg instead of just Dbg though.

Theoretically it could live in pkg/kvstore/etcd_debug.go alongside the other etcd dbg dialer but importing github.com/cilium/cilium/pkg/dial produce an import cycle there :(. So it probably is simpler to keep it in this package since it's the only consumer and don't produce an import cycle. I could do a separate file in this package if you think that's better though?

@MrFreezeex MrFreezeex marked this pull request as draft January 28, 2026 11:56
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

Sorry for the long time before making this ready again, I got busy with other things but it should be finally ready again/most comments have been addressed!

@MrFreezeex MrFreezeex marked this pull request as ready for review February 8, 2026 17:29
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

Copy link
Copy Markdown
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, and sorry for the delay. A few more comments inline.

@MrFreezeex MrFreezeex marked this pull request as draft February 17, 2026 12:48
@MrFreezeex MrFreezeex force-pushed the hostaliases-removal branch 2 times, most recently from 35b6b0b to 7a6daed Compare February 18, 2026 14:04
@MrFreezeex

This comment was marked as outdated.

@MrFreezeex
Copy link
Copy Markdown
Member Author

MrFreezeex commented Feb 18, 2026

/test

EDIT: note that the push right after this comment is a rebase to fix some conflict on the CI and the helm template from the clustermesh-apiserver/deployment.yaml)

This commit add a Cilium config struct parsed from the etcd config to connect
to each remote clustermesh-apiserver. These fields are ignored by etcd
which ignore any unknown fields (like most go program do) and will be
used in a future commit to replace the host aliases on Pods which forces
restarts when changed.

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
Add a new static dialier and wire it in the clustermesh code with the
config that was added in the previous commit.

This new static dialer attempt to be very close to the net.Dialer with very
similar code and partially the same behavior (it has to be a bit
different since we don't resolve anything though). It only match one single
hostname and will fallback to another dialer so that we can just chain
dialer like the one we are already using.

This dialer will attempt every IPs and if the context has a deadline
set it will have partial distribution logic which essentially boils down to
dividing the remaining deadline from the original context with the number of
remaining IPs to attempt connection to. This logic is directly taken from the
net.Dialer logic. Note that to my knowledge we don't have any deadline set
in the ClusterMesh code though.

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
Add the cilium-host-aliases in the etcd config which should be fully replacing
hostaliases in Cilium 1.21. We are keeping hostAliases for now to keep
downgrade safe (secrets updated without cilium-clustermesh-apiserver-host and
potentially without hostAliases if we were already removing it here).

We are force disabling this in the conformance ClusterMesh CI via a non
documented helm values to already exercise that this is working properly
while the conformance upgrade CI stays on default setting to keep the
existing flow tested!

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
Previous commits added a new static dialer in order to replace
hostAliases. This commit now wires this new static dialer to the
clustermesh troubleshoot command.

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
@MrFreezeex
Copy link
Copy Markdown
Member Author

/test

@MrFreezeex
Copy link
Copy Markdown
Member Author

MrFreezeex commented Feb 18, 2026

Sorry I hate to do this to you, I was initially going to just do another test branch on cilium/cilium repo with the same code as I am now modifying some CI file to better test this new change (and I didn't anticipated that so this PR is open from my fork) but I ended up fighting the CI (more accurately my own mistakes) here so I don't think the diff are super readable anyway unfortunately, so I will close this PR to replace it with the one on the cilium/cilium repo directly here: #44425.

The differences are mostly to incorporate all the nice suggestion from latest reviews of @giorio94.

@giorio94
Copy link
Copy Markdown
Member

No problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CFP: removing most hostAliases usage in ClusterMesh

4 participants