kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'#136014
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/hold |
|
cc @dlipovetsky |
| } else { | ||
| klog.V(1).Infof("[etcd] Successfully removed learner member %s", etcdPeerAddress) | ||
| } | ||
| return errors.Wrap(err, "The etcd pod did not start successfully") |
There was a problem hiding this comment.
Like line 183: failed to remove xxx?
The error may be something like: failed to start xxx?
the learner member was removed successfully
Overall LGTM. Some logs or errors are not aligned.
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pacoxu, SataQiu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
neolit123
left a comment
There was a problem hiding this comment.
the problems CAPI users are having is that they have old, big, non-defragmented etcd DBs and when they try to kubeadm join new nodes etcd doesn't have enough time to sync the member data and kubeadm timeouts at the Etcd Client timeout we have for member promotion.
this change adds another timeout in between ControlPlaneComponentHealthCheck, but there is no guarantee if the same users will hit this timeout first. for these big DBs they must configure the timeouts. kubeadm will not block forever, obviously.
the change overall seems OK to me, but kubeadm has a history of breaking something related to the etcd logic, when trying to improve something in the etcd logic.
in terms of backport, my vote is +0.
this is somewhere in between a bugfix and feature improvement.
| waiter := apiclient.NewKubeWaiter(client, kubeadmapi.GetActiveTimeouts().ControlPlaneComponentHealthCheck.Duration, os.Stdout) | ||
| if err := waiter.WaitForStaticPodRunning(nodeName, kubeadmconstants.Etcd); err != nil { | ||
| klog.V(1).Infof("[etcd] The etcd pod did not start successfully, removing learner member %s", etcdPeerAddress) | ||
| if removeErr := removeLearnerMember(etcdClient, etcdPeerAddress); removeErr != nil { |
There was a problem hiding this comment.
can this cause problems for users who were previously manually doing the remove?
There was a problem hiding this comment.
This is also something I'm unsure about. Given the doubts about this, do you agree that we should just remove removeLearnerMember call and instead leave it to the users to handle? @neolit123
Even if the unstarted etcd member is not removed from etcd cluster, it will not cause any harmful effects and will not affect the subsequent retry attempts to join the node.
There was a problem hiding this comment.
i think the member not being removed presents a problem for CAPI users, because CAPI might try to join a machine with a different IP and not that same machine that previously failed, so the previous member will be stale.
on the other hand users who were doing the manual cleanup might now get an error that no such member exists.
i am in favor of supporting the CAPI case, but just trying to understand if we have significant concerns about the manual case.
kubeadm: verify the etcd pod is running before promoting learner
we should add a sentence about that in the RN
There was a problem hiding this comment.
If the new joining node is a fresh machine with the same IP, will that be a problem if we did not remove it. (They have to remove it manually then).
Is there a side effect to remove the stale learner member? The etcd data folder should be clean up as well IIUC.
Leaving it to users to manually is also an option. I prefer to remove it if there is no special concern.
There was a problem hiding this comment.
i think i agree, the correct way seems to do this is to remove the data dir as well.
which pretty much means we need to call the code from app/cmd/phases/reset/removeetcdmember.go from the join phases, which is doable if it's exposed more like a backend, but a bit sketchy.
maybe, if that is not done correctly the partially synced state can no longer be valid. i don't know if etcd will be smart enough to re-sync the whole DB properly again or can that cause problems.
There was a problem hiding this comment.
in that case we only should be removing the member and not the data.
You can't do this. After you removed a member, you can't reuse the data anymore. We discussed this in etcd community. The data should can be reused from user perspective, but there is a technical restriction for now.
ok, so in that case my original proposal for case 2 is valid.
#136014 (comment)
effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.
There was a problem hiding this comment.
effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.
If sync doesn't finish before the etcd API call timeout for learner promotion, does it make more sense to keep retrying? Otherwise, you will have to start over (re-sync the data) again.
There was a problem hiding this comment.
#136014 (comment)
effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.If
sync doesn't finish before the etcd API call timeout for learner promotion, does it make more sense to keep retrying? Otherwise, you will have to start over (re-sync the data) again.
it's possible to set a custom timeout for the promotion retry, but kubeadm doesn't block indefinitely in any areas.
There was a problem hiding this comment.
So, what is the conclusion? Does Kubeadm still need to handle the cleanup of folders? Case 2 seems rather challenging. It's not easy for us to determine whether the etcd data can be safely deleted. Perhaps it should be left to the user to handle. For example, performing "kubeadm reset" and then rejoining.
There was a problem hiding this comment.
agreed, for case 2 we'd need to check for the sync error, and we don't know who it's going to break if we start removing the member data after the pod has started. i checked the CAPI KCP code and they at least make sure to only delete members that exist.
for case 1, i'm not convinced we need to remove the data dir. if the pod never started the data dir will be empty, right?
04bac7f to
bc72d12
Compare
There was a problem hiding this comment.
/lgtm
/hold
@pacoxu any concerns about the removal of members?
#136014 (comment)
|
/lgtm |
|
@HirazawaUi WDYT, about the above assessment? |
…ng during 'kubeadm join'
|
LGTM label has been added. DetailsGit tree hash: 09c55e2b2c63cf5afb83500e32a7b9404e3c38de |
|
/retest |
| func (c *Client) listMembersOnce() (*clientv3.MemberListResponse, error) { | ||
| cli, err := c.newEtcdClient(c.Endpoints) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| defer func() { _ = cli.Close() }() | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), etcdTimeout) |
There was a problem hiding this comment.
Apologies, I just got off work....
There is still an issue here: we added a 2-second timeout for the request cli.MemberList(ctx) in the listMembersOnce method. However, we are still calling c.getMemberStatus(learnerID) -> c.listMembersOnce() on a 500ms cycle. This may cause requests to etcd to be blocked under certain network anomalies, and all responses could be returned together once the network recovers... I think this does not align with our expectations.
There was a problem hiding this comment.
Apologies, I just got off work....
if it's too late in CN there is no rush to reply.
There is still an issue here: we added a 2-second timeout for the request cli.MemberList(ctx) in the listMembersOnce method. However, we are still calling c.getMemberStatus(learnerID) -> c.listMembersOnce() on a 500ms cycle. This may cause requests to etcd to be blocked under certain network anomalies, and all responses could be returned together once the network recovers... I think this does not align with our expectations.
i believe in many places we have a poll with the 500ms retry interval (e.g. AddMember) but the context timeout for a etcd API call is 2 seconds. i'm not sure what should we do here. is that a refactor for a separate PR?
There was a problem hiding this comment.
I also noticed this "inversion" (outer interval duration is shorter than inner context deadline) during an earlier review, but I thought it would be out of scope for this PR, because it happens in various places.
There was a problem hiding this comment.
i'm not sure what should we do here. is that a refactor for a separate PR?
Alright, let's do this in a separate PR
|
/lgtm |
|
/retest |
…#136014-upstream-release-1.35 Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
…#136014-upstream-release-1.33 Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
…#136014-upstream-release-1.34 Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
…#136014-upstream-release-1.32 Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
What type of PR is this?
/kind bug
What this PR does / why we need it:
kubeadm: verify the etcd pod is running before promoting learner
Which issue(s) this PR is related to:
Fixes #kubernetes/kubeadm#3269
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: