-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Cilium 1.19 in kvstore identity mode fails when etcd is exposed behind a service #44527
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.19.1 and lower than v1.20.0
What happened?
In a cluster (cluster_A) in KVStore identity-allocation-mode we connect the Cilium agent to an etcd running on a different cluster (cluster_B). The FQDN for the etcd in cluster_B is something like etcd.cluster_A_k8s.svc.cluster_B_fqdn. On cluster_A, the Cilium agents try to connect to the etc in cluster_B during initialization but get stuck when trying to resolve the FQDN of the etcd endpoint.
We have narrowed down the deadlock to this for loop in the new lbServiceResolver resolve function
Lines 157 to 170 in cc9cd28
| for !init { | |
| pending := sr.frontends.PendingInitializers(txn) | |
| if !slices.ContainsFunc(pending, func(s string) bool { return strings.HasPrefix(s, reflectors.K8sInitializerPrefix) }) { | |
| break | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| return host | |
| case <-waitInit: | |
| init = true | |
| case <-time.After(100 * time.Millisecond): | |
| } | |
| txn = sr.db.ReadTxn() | |
| } |
It seems the initialization never happens as this etcd connection occurs too early on within the startup of the agent. One way we have found to fix this issue is to add a deadline to the loop function and fallback to returning the host. It then gets picked up by the host's DNS resolver and gets resolved properly. Ideally, this logic should not be used for resolving etcd host.
How can we reproduce the issue?
This can be reproduced by setting up Cilium with identity-allocation-mode: kvstore and kvstore: etcd. The etcd endpoint can be in the format etcd.namespace.svc.<cluster_fqdn>.
Cilium Version
v1.19.1
Kernel Version
Linux 6.8.0-1044-aws #46~22.04.1-Ubuntu SMP Tue Dec 2 18:01:57 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
Kubernetes Version
v1.34.3
Regression
v1.18.7
Sysdump
No response
Relevant log output
"2026-02-25T12:41:18.161Z","cluster-cni","Establishing connection to kvstore"
"2026-02-25T12:41:18.161Z","cluster-cni","Creating etcd client"
"2026-02-25T12:41:18.162Z","cluster-cni","Connecting to etcd server..."
"2026-02-25T12:41:18.794Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:24.797Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:32.798Z","cluster-cni","Error while getting Cilium status"
"2026-02-25T12:41:39.794Z","cluster-cni","Error while getting Cilium status"Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct