feat: syncer integ#6919
Conversation
fe8899f to
34f198f
Compare
3fcd5b9 to
c508008
Compare
1851051 to
6604330
Compare
9e2d975 to
34a1907
Compare
df99c9c to
c66b9fc
Compare
Signed-off-by: Alex Pana <8968914+acpana@users.noreply.github.com>
Signed-off-by: Alex Pana <8968914+acpana@users.noreply.github.com>
d7eec8a to
98c0d89
Compare
| @@ -153,7 +157,7 @@ func setUpMultiClusterLease(ctx context.Context, restConfig *rest.Config, scheme | |||
| // Create a new client for the lock to ensure it uses the correct configuration | |||
There was a problem hiding this comment.
AGENT: If Config Connector is running in namespace mode (multiple controller instances), they will all retrieve the exact same cluster-scoped ConfigConnector object (configconnector.core.cnrm.cloud.google.com) and attempt to acquire the exact same MultiClusterLease lock (LeaseName and Namespace).
This will cause all but one namespace controller across the entire cluster to be blocked as followers, effectively breaking namespace mode. Leader election leases must be distinct per KCC manager instance (e.g., by incorporating the manager's target namespace into the lease name).
TAKO: Ideally (at lest for V1) the operator would acquire/renew the lock and the individual managers would validate the lock was held by "them" (same ID). Having said that as long as all the managers in a cluster are using the same ID they should just be updating the lease time and renewing without a problem. One thought on this is given 10K namespaces, we get 10K manager. Renewing the lease every minute, means we are sending 170 QPS writes at the KAS, just for lease renewals...
We should also make sure our testing covers the multiple manager case. We should have a test for namespace mode, 2+ namespaces configured to reconcile and MCL and sync and ensure that 1 cluster is actually active and that the other cluster is seeing syncing in both namespaces.
There was a problem hiding this comment.
Renewing the lease every minute, means we are sending 170 QPS writes at the KAS, just for lease renewals...
I'm curious if this would be something we could live with for V1? I think having each manager do this call is not ideal but keeping the call in the manager layer let's us keep the door open for namespace sharding too but YAGNI!
Another trick I was thinking we could implement was to designate on of the kccmanagers as the "lease leader" (maybe first one to wake up and announce itself to the operator).
We should also make sure our testing covers the multiple manager case. We should have a test for namespace mode, 2+ namespaces configured to reconcile and MCL and sync and ensure that 1 cluster is actually active and that the other cluster is seeing syncing in both namespaces.
I think this is covered in #7053
There was a problem hiding this comment.
I'm curious if this would be something we could live with for V1?
Probably but I would suggest you should have a plan for monitoring and for mitigating. They should probably go in your feature doc. I would say mitigate is probably change the lease period. Ten minutes rather than one gets you down to 17 QPS. Do you have a way to tell?
There was a problem hiding this comment.
Do you have a way to tell?
I think we can leverage KAS audit logs and I can add some more metrics as a follow on PR to look specifically for the syncer_integration.
There was a problem hiding this comment.
Audit isn't always turned on. Also requires a fair amount of parsing code to determine. Metrics seem like a good option. Tricky part with metrics, is that from the client side you need to aggrege a lot of different processes metrics together to get the answer.
98c0d89 to
2869583
Compare
cheftako
left a comment
There was a problem hiding this comment.
Can we make on shutdown not block exit?
2869583 to
8358b9a
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cheftako The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Adds syncer integration in operator and kccmanager
BRIEF Change description
Fixes #
WHY do we need this change?
Special notes for your reviewer:
Does this PR add something which needs to be 'release noted'?
Additional documentation e.g., references, usage docs, etc.:
Intended Milestone
Please indicate the intended milestone.
Tests you have done
make ready-prto ensure this PR is ready for review.