-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Webhooks are not guaranteed to start before cache sync is started #1685
Description
In the process of debugging kubernetes-sigs/cluster-api-provider-aws#2834 we spot the following behaviour:
If resources are present in API Server/etcd at an older storage version AND --leader-election is set to false when starting controllers with a newer CRD version, then there is a high probability that conversion webhooks will not start before cache syncs start, which due to conversion webhooks not being present and a mutex being held, indefinitely delay the start of the webhook server. On some infrastructures, this is closer to 90% (e.g. Amazon EC2 t3.large (2 CPU, 8GB RAM)).
When leader election is enabled, then the probability drops to 20%, and when using healthchecks for the webhooks, sufficient restarts by kubelet lead to the conversion webhooks running sufficiently early to allow cache syncs to occur.
4e548fa was intended to ensure that webhooks start first, but this doesn't seem to have had the desired effect.
We believe that the occurrence is lower when leader election happens since the leader election process delays the call of cm.startLeaderElectionRunnables() until after election is completed.
Example logs are here: capi-logs.txt
Controller-runtime-version: v0.10.2
Affected consumers: Cluster API v1.0.0