Add etcd-operator seamless integration with TLS#4921
Conversation
|
test-me-please |
There was a problem hiding this comment.
exported function CreateDefaultClient should have comment or be unexported
|
test-me-please |
23c738f to
da2289c
Compare
|
test-me-please |
| } | ||
| } | ||
|
|
||
| k8sclient.Configure("", "/var/lib/cilium/cilium.kubeconfig") |
There was a problem hiding this comment.
This is a leftover, it needs to be fixed as when using cilium kvstore get the cilium CLI command does not have the necessary configuration to contact kubernetes, to retrieve the certificates to contact etcd.
There was a problem hiding this comment.
A few questions:
- Is the kubeconfig always guaranteed to be at this path? I presume yes because we are mounting it as such in the Cilium DS, but this is user-configurable, yes?
- This will always result in
k8sclient.IsEnabled()to return true:
Lines 60 to 66 in db67415
IsEnabled returns true if the KubeconfigPath is non-nil, and you unconditionally set KubeconfigPath above in your call to k8sclient.Configure:
Lines 48 to 58 in db67415
Won't this break non-K8s setups?
There was a problem hiding this comment.
I think the path is returned by /config in https://github.com/cilium/cilium/blob/master/api/v1/models/daemon_configuration_status.go#L33:2 (I noticed this being printed in https://github.com/cilium/cilium/blob/master/cilium/cmd/config.go#L80:59)
That might simplify things a bit?
| return nil, err | ||
| } | ||
| return &tlsData{ | ||
| certData: secret.Data[etcdutil.CliCertFile], |
There was a problem hiding this comment.
Let's define constants for this as this is a stable API. No need to add dependencies.
| - "--debug=$(CILIUM_DEBUG)" | ||
| - "-t=vxlan" | ||
| - "--kvstore=etcd" | ||
| - "--kvstore-opt=etcd.k8s.secret=kube-system/kube-system/cilium-etcd-client-tls" |
ianvernon
left a comment
There was a problem hiding this comment.
Still reviewing, have one important observation that I'd like clarification on.
|
|
||
| var ( | ||
| // log is the k8s package logger object. | ||
| log = logging.DefaultLogger.WithField(logfields.LogSubsys, subsysK8s) |
There was a problem hiding this comment.
Should LogSubsys be changed due to this refactoring?
| } | ||
| } | ||
|
|
||
| k8sclient.Configure("", "/var/lib/cilium/cilium.kubeconfig") |
There was a problem hiding this comment.
A few questions:
- Is the kubeconfig always guaranteed to be at this path? I presume yes because we are mounting it as such in the Cilium DS, but this is user-configurable, yes?
- This will always result in
k8sclient.IsEnabled()to return true:
Lines 60 to 66 in db67415
IsEnabled returns true if the KubeconfigPath is non-nil, and you unconditionally set KubeconfigPath above in your call to k8sclient.Configure:
Lines 48 to 58 in db67415
Won't this break non-K8s setups?
| cfgOption = "etcd.config" | ||
| k8sSecretNameOption = "etcd.k8s.secret" | ||
|
|
||
| k8sSecretNamespace = "kube-system" |
There was a problem hiding this comment.
Is this guaranteed to always be kube-system?
| return getOpts(e.opts) | ||
| } | ||
|
|
||
| func getNamespaceName(s string) (string, string, error) { |
There was a problem hiding this comment.
Add a unit test for this function.
| - apiGroups: | ||
| - extensions | ||
| resources: | ||
| - networkpolicies #FIXME remove this when we drop support for k8s NP-beta GH-1202 |
| the label `io.cilium.fixed-identity=kube-dns`. | ||
|
|
||
| ``` | ||
| kubectl label -n kube-system pod $(kubectl get pods -n kube-system | grep kube-dns | cut -f 1 -d " ") io.cilium.fixed-identity=kube-dns |
There was a problem hiding this comment.
Use JSON parsing instead of grep / cut; this has proven brittle for us in the past due to the K8s CLI formatting changes.
| $ kubectl get pods -n kube-system | ||
| NAME READY STATUS RESTARTS AGE | ||
| etcd-operator-69b5bfc669-fvm8c 1/1 Running 0 1d | ||
| kube-dns-7dcc557ddd-tsqjt 3/3 Running 12 1d |
There was a problem hiding this comment.
Should we change this to zero restarts? ;)
There was a problem hiding this comment.
@ianvernon this is actually expected because kube-dns will keep restarting until the user changes the label
There was a problem hiding this comment.
I'd make a note of this then in the "Deploy kube-dns..." step above. As a user who is following the guide, seeing restarts may be worrisome.
8cbe0de to
edded5c
Compare
|
test-me-please |
edded5c to
485726e
Compare
There was a problem hiding this comment.
How do the contents of this commit match the commit title ?
|
@aanm I think that the jsonpath that you ask can be in different ways: To retrieve only one: To retrieve all: |
There was a problem hiding this comment.
That undefined should be a const no?
There was a problem hiding this comment.
and also test for the string being empty?
There was a problem hiding this comment.
This is a copy from another part of the code.
There was a problem hiding this comment.
Should we add this into the contributing guide as a dependency?
There was a problem hiding this comment.
I'd rather we put it into a docker image and have some type of wrapper script which generates stuff within that image so that we don't have to worry about maintaining more dependencies outside of images.
There was a problem hiding this comment.
This is a dependency on the user system.
- We probably don't want to require being able to run containers on the system that is running kubectl.
- There may be some resistance to running a random container image to generate certificates.
We can definitely offer it as an alternative but I would not make it the default.
raybejjani
left a comment
There was a problem hiding this comment.
I couldn't get the readme instructions to work. I built cilium-agent locally, and modified the spec to use that file (mounted in at a specific path).
I could only get the cilium-etcd to start when I deleted cilium-development.yaml, but then cilium-development.yaml seemed to bounce between being ready and not. It also seemed like cilium status was blocked on something sometimes.
There was a problem hiding this comment.
Can source == FromAgentLocal be overwritten by any other source? We don't end up calling UpdateNode in updateK8sNodeTunneling but checking here makes the relationships between sources more clear.
There was a problem hiding this comment.
and also test for the string being empty?
|
Following up on my comment. I tested with a prebuilt image for this PR (in case my hacky way was the problem). I did find that, on minikube, one needs to edit |
There was a problem hiding this comment.
Let's drop this once we remove the support for 1.7
There was a problem hiding this comment.
I'd rather we put it into a docker image and have some type of wrapper script which generates stuff within that image so that we don't have to worry about maintaining more dependencies outside of images.
| $ kubectl get pods -n kube-system | ||
| NAME READY STATUS RESTARTS AGE | ||
| etcd-operator-69b5bfc669-fvm8c 1/1 Running 0 1d | ||
| kube-dns-7dcc557ddd-tsqjt 3/3 Running 12 1d |
There was a problem hiding this comment.
I'd make a note of this then in the "Deploy kube-dns..." step above. As a user who is following the guide, seeing restarts may be worrisome.
There was a problem hiding this comment.
Instead of FIXME create a GH issue and note it here in the format "GH-XXXX"
There was a problem hiding this comment.
This is a copy from another part of the code.
There was a problem hiding this comment.
Let's drop this once we remove the support for 1.7
There was a problem hiding this comment.
It won't help. The etcd-operator will take some time until it creates the CRD. I'll create a file with the CRD and add it to the directory.
There was a problem hiding this comment.
This is a copy from another part of the code.
There was a problem hiding this comment.
This is a copy from another part of the code.
…ator Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
6e90aa6 to
031c9c6
Compare
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: André Martins <andre@cilium.io>
031c9c6 to
022e7ee
Compare
|
test-me-please |
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {cilium#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(cilium#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {cilium#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(cilium#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
As the previous commit mentions, this does not occur on main, v1.16, and v1.15 due to cilium#29036. However, in v1.14 and v1.13, we need to take special care because the NameManager and SelectorCache lock can be taken while the Endpoint lock is held during Endpoint deletion. Here are the relevant stacktraces regarding the deadlock: ``` 1: sync.Mutex.Lock [75 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*uint32(0x5), false, 43690) sync mutex.go:171 (*Mutex).lockSlow(*Mutex(cilium#1733)) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*RWMutex(0xb0)) fqdn name_manager.go:70 (*NameManager).Lock(0xffffffffffffffff) policy selectorcache.go:964 (*SelectorCache).RemoveSelectors(cilium#1088, {cilium#28569, 0xb, 1}, {cilium#643, cilium#32582}) policy l4.go:810 (*L4Filter).removeSelectors(cilium#32582, cilium#29992) policy l4.go:817 (*L4Filter).detach(cilium#719, cilium#29993) policy l4.go:988 L4PolicyMap.Detach(...) policy l4.go:1179 (*L4Policy).Detach(cilium#20318, cilium#1383) policy resolve.go:103 (*selectorPolicy).Detach(...) policy distillery.go:81 (*PolicyCache).delete(cilium#1354, cilium#19354) policy distillery.go:138 (*PolicyCache).LocalEndpointIdentityRemoved(cilium#523, cilium#1163) identitymanager manager.go:167 (*IdentityManager).remove(cilium#706, cilium#19354) identitymanager manager.go:147 (*IdentityManager).Remove(cilium#706, cilium#19354) identitymanager manager.go:52 Remove(...) endpoint endpoint.go:1146 (*Endpoint).leaveLocked(cilium#1883, cilium#12221, {0x30, 0}) endpoint endpoint.go:2192 (*Endpoint).Delete(cilium#1883, {0x80, 0xaa}) endpointmanager manager.go:380 (*EndpointManager).removeEndpoint(cilium#1161, 0, {0xff, 0xff}) endpointmanager manager.go:394 (*EndpointManager).RemoveEndpoint(...) cmd endpoint.go:684 (*Daemon).deleteEndpointQuiet(...) cmd endpoint.go:666 (*Daemon).deleteEndpoint(cilium#1155, cilium#1883) cmd endpoint.go:713 (*Daemon).DeleteEndpoint(cilium#1155, {cilium#27399, cilium#8108}) cmd endpoint.go:770 (*deleteEndpointID).Handle(cilium#21193, {cilium#2452, {cilium#27399, 0x4d}}) endpoint delete_endpoint_id.go:66 (*DeleteEndpointID).ServeHTTP(cilium#5934, {cilium#666, cilium#2242}, cilium#2452) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, cilium#2242}, cilium#2452) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#130), func{cilium#2242, 0x3}) middleware router.go:78 NewRouter.func1({cilium#666, cilium#2242}, cilium#2451) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#718), func{cilium#2242, #59}) middleware redoc.go:72 Redoc.func1({cilium#666, cilium#2242}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#4920), func{cilium#2242, #45}) middleware spec.go:46 Spec.func1({cilium#666, cilium#2242}, cilium#4921) http server.go:2136 HandlerFunc.ServeHTTP(ReadCloser(cilium#10532), func{cilium#2242, cilium#23015}) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, cilium#10533}, cilium#2451) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(cilium#722, {cilium#668, cilium#10533}, cilium#4922) http server.go:2938 serverHandler.ServeHTTP(*Server(cilium#8105), cilium#668, cilium#10533, 0x6) http server.go:2009 (*conn).serve(*conn(cilium#16005), Context{cilium#673, cilium#1554}) 8: sync.Mutex.Lock [74 minutes] [Created by http.(*Server).Serve in goroutine 1699 @ server.go:3086] sync sema.go:77 runtime_SemacquireMutex(*, 0x47, cilium#1154) sync mutex.go:171 (*Mutex).lockSlow(cilium#706) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(*) identitymanager manager.go:99 (*IdentityManager).RemoveOldAddNew(cilium#706, 0, cilium#1154) identitymanager manager.go:123 RemoveOldAddNew(...) endpoint policy.go:852 (*Endpoint).SetIdentity(*, cilium#1154, 0) endpoint endpoint.go:1932 (*Endpoint).identityLabelsChanged(*, {cilium#674, *}, 1) endpoint endpoint.go:1780 (*Endpoint).runIdentityResolver(*, {cilium#674, *}, 1, 1) endpoint endpoint.go:1720 (*Endpoint).UpdateLabels(*, {cilium#674, *}, *, *, 8) cmd endpoint.go:477 (*Daemon).createEndpoint(cilium#1155, {cilium#673, *}, {cilium#683, cilium#1155}, *) cmd endpoint.go:542 (*putEndpointID).Handle(cilium#21192, {*, *, {*, 0xe}}) endpoint put_endpoint_id.go:58 (*PutEndpointID).ServeHTTP(cilium#3961, {cilium#666, *}, *) middleware operation.go:28 (*Context).RoutesHandler.NewOperationExecutor.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware router.go:78 NewRouter.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) middleware redoc.go:72 Redoc.func1({cilium#666, *}, cilium#1251) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, #45) middleware spec.go:46 Spec.func1({cilium#666, *}, *) http server.go:2136 HandlerFunc.ServeHTTP(*, {cilium#666, *}, *) metrics middleware.go:64 (*APIEventTSHelper).ServeHTTP(cilium#1459, {cilium#668, *}, *) api apipanic.go:42 (*APIPanicHandler).ServeHTTP(#49, {cilium#668, *}, *) http server.go:2938 serverHandler.ServeHTTP({cilium#653}, {cilium#668, *}, 6) http server.go:2009 (*conn).serve(*, {cilium#673, cilium#1554}) 5: sync.Mutex.Lock [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] sync sema.go:77 runtime_SemacquireMutex(cilium#142, 0xe8, *) sync mutex.go:171 (*Mutex).lockSlow(cilium#1733) sync mutex.go:90 (*Mutex).Lock(...) sync rwmutex.go:147 (*RWMutex).Lock(0x68) fqdn name_manager.go:70 (*NameManager).Lock(*) policy selectorcache.go:798 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, *}, {{*, 0x4d}, {0, 0}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(*, {*, 4, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, *}, {*, 1, 1}, 0, {cilium#660, *}, {{*, 4}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, *}, #44, {*, 0xa, 0}, *, {cilium#660, *}, {{*, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, *}) api l4.go:284 PortRules.Iterate({*, 1, cilium#546}, *) policy rule.go:624 mergeEgress({cilium#680, *}, *, {*, 1, 1}, 0, {cilium#661, *}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(*, {cilium#680, *}, *, *, *, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({*, *, *}, {cilium#680, *}, *) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, *) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, *) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(*) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(*, *, *) endpoint bpf.go:589 (*Endpoint).regenerateBPF(*, *) endpoint policy.go:457 (*Endpoint).regenerate(*, *) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(*, *) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*, *) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(*) 1: select [75 minutes] [Created by eventqueue.(*EventQueue).Run in goroutine 1482 @ eventqueue.go:229] semaphore semaphore.go:60 (*Weighted).Acquire(cilium#1092, {cilium#671, cilium#722}, cilium#766) lock semaphored_mutex.go:30 (*SemaphoredMutex).Lock(...) ipcache ipcache.go:140 (*IPCache).Lock(...) ipcache cidr.go:56 (*IPCache).AllocateCIDRs(cilium#1316, {0, 0, cilium#2037}, {0, 0, 0}, 0) ipcache cidr.go:103 (*IPCache).AllocateCIDRsForIPs(0, {0, cilium#697, 0}, 0xffffffffffffffff) cmd identity.go:114 cachingIdentityAllocator.AllocateCIDRsForIPs(...) policy selectorcache.go:509 (*SelectorCache).allocateIdentityMappings(cilium#1088, {{0, 0}, {cilium#5036, 0x2b}}, #45) policy selectorcache.go:843 (*SelectorCache).AddFQDNSelector(cilium#1088, {cilium#643, cilium#27811}, {{0, 0}, {cilium#5036, 0x2b}}) policy l4.go:628 (*L4Filter).cacheFQDNSelector(...) policy l4.go:623 (*L4Filter).cacheFQDNSelectors(cilium#27811, {cilium#42936, 0x287, cilium#193}, cilium#536) policy l4.go:725 createL4Filter({cilium#680, cilium#22826}, {cilium#22828, 1, 1}, 0, {cilium#660, cilium#24510}, {{cilium#4487, 3}, ...}, ...) policy l4.go:879 createL4EgressFilter(...) policy rule.go:717 mergeEgressPortProto({cilium#680, cilium#22826}, #44, {cilium#22828, 0xa, #78536}, #79633, {cilium#660, cilium#24510}, {{cilium#4487, ...}, ...}, ...) policy rule.go:672 mergeEgress.func1({cilium#660, cilium#24510}) api l4.go:284 PortRules.Iterate({cilium#24510, 1, cilium#546}, cilium#11741) policy rule.go:624 mergeEgress({cilium#680, cilium#22826}, cilium#18687, {cilium#22828, 1, 1}, 0, {cilium#661, cilium#5624}, {cilium#662, ...}, ...) policy rule.go:753 (*rule).resolveEgressPolicy(cilium#24575, {cilium#680, cilium#22826}, cilium#18687, cilium#29345, cilium#4782, {0, 0, 0}, {0, ...}) policy rules.go:103 ruleSlice.resolveL4EgressPolicy({cilium#10690, 0xb, 5}, {cilium#680, cilium#22826}, cilium#18687) policy repository.go:718 (*Repository).resolvePolicyLocked(cilium#1089, cilium#18461) policy distillery.go:119 (*PolicyCache).updateSelectorPolicy(cilium#1354, cilium#18461) policy distillery.go:153 (*PolicyCache).UpdatePolicy(...) endpoint policy.go:262 (*Endpoint).regeneratePolicy(cilium#1748) endpoint bpf.go:744 (*Endpoint).runPreCompilationSteps(cilium#1748, cilium#27542, cilium#4781) endpoint bpf.go:589 (*Endpoint).regenerateBPF(cilium#1748, cilium#27542) endpoint policy.go:457 (*Endpoint).regenerate(cilium#1748, cilium#27542) endpoint events.go:53 (*EndpointRegenerationEvent).Handle(cilium#18609, cilium#703) eventqueue eventqueue.go:245 (*EventQueue).run.func1() sync once.go:74 (*Once).doSlow(*Once(cilium#45414), func(cilium#69)) sync once.go:65 (*Once).Do(...) eventqueue eventqueue.go:233 (*EventQueue).run(cilium#6023) ``` Generated from pp tool: https://github.com/maruel/panicparse Signed-off-by: Chris Tarazi <chris@isovalent.com>
EDIT: It turns out there's no need to perform code-changes. I'll used our current etcd-configuration map and mount the
cilium-etcd-client-tlssecrete into the Cilium pod.Summary of changes:
Add seamless integration for etcd-operator running with TLS.
A developer cilium.yaml is available in
examples/kubernetes/addons/etcd-operator/cilium-development.yamlwhich contains the necessary options to run etcd-operator.This change is