Skip to content

Cilium 1.19 in kvstore identity mode fails when etcd is exposed behind a service #44527

@41ks

Description

@41ks

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.19.1 and lower than v1.20.0

What happened?

In a cluster (cluster_A) in KVStore identity-allocation-mode we connect the Cilium agent to an etcd running on a different cluster (cluster_B). The FQDN for the etcd in cluster_B is something like etcd.cluster_A_k8s.svc.cluster_B_fqdn. On cluster_A, the Cilium agents try to connect to the etc in cluster_B during initialization but get stuck when trying to resolve the FQDN of the etcd endpoint.

We have narrowed down the deadlock to this for loop in the new lbServiceResolver resolve function

cilium/pkg/dial/resolver.go

Lines 157 to 170 in cc9cd28

for !init {
pending := sr.frontends.PendingInitializers(txn)
if !slices.ContainsFunc(pending, func(s string) bool { return strings.HasPrefix(s, reflectors.K8sInitializerPrefix) }) {
break
}
select {
case <-ctx.Done():
return host
case <-waitInit:
init = true
case <-time.After(100 * time.Millisecond):
}
txn = sr.db.ReadTxn()
}
introduced in PR #42440.

It seems the initialization never happens as this etcd connection occurs too early on within the startup of the agent. One way we have found to fix this issue is to add a deadline to the loop function and fallback to returning the host. It then gets picked up by the host's DNS resolver and gets resolved properly. Ideally, this logic should not be used for resolving etcd host.

How can we reproduce the issue?

This can be reproduced by setting up Cilium with identity-allocation-mode: kvstore and kvstore: etcd. The etcd endpoint can be in the format etcd.namespace.svc.<cluster_fqdn>.

Cilium Version

v1.19.1

Kernel Version

Linux 6.8.0-1044-aws #46~22.04.1-Ubuntu SMP Tue Dec 2 18:01:57 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux

Kubernetes Version

v1.34.3

Regression

v1.18.7

Sysdump

No response

Relevant log output

"2026-02-25T12:41:18.161Z","cluster-cni","Establishing connection to kvstore"
"2026-02-25T12:41:18.161Z","cluster-cni","Creating etcd client"
"2026-02-25T12:41:18.162Z","cluster-cni","Connecting to etcd server..."
"2026-02-25T12:41:18.794Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:24.797Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:32.798Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:39.794Z","cluster-cni","Error while getting Cilium status"

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

area/clustermeshRelates to multi-cluster routing functionality in Cilium.area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.area/kvstoreImpacts the KVStore package interactions.area/loadbalancingImpacts load-balancing and Kubernetes service implementationskind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions