Skip to content

kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'#136014

Merged
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
SataQiu:fix-20260104
Jan 18, 2026
Merged

kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'#136014
k8s-ci-robot merged 1 commit into
kubernetes:masterfrom
SataQiu:fix-20260104

Conversation

@SataQiu

@SataQiu SataQiu commented Jan 4, 2026

Copy link
Copy Markdown
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

kubeadm: verify the etcd pod is running before promoting learner

Which issue(s) this PR is related to:

Fixes #kubernetes/kubeadm#3269

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 4, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jan 4, 2026
@k8s-ci-robot k8s-ci-robot added area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 4, 2026
@SataQiu

SataQiu commented Jan 4, 2026

Copy link
Copy Markdown
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 4, 2026
@pacoxu

pacoxu commented Jan 4, 2026

Copy link
Copy Markdown
Member

cc @dlipovetsky
/assign

Comment thread cmd/kubeadm/app/phases/etcd/local.go Outdated
} else {
klog.V(1).Infof("[etcd] Successfully removed learner member %s", etcdPeerAddress)
}
return errors.Wrap(err, "The etcd pod did not start successfully")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like line 183: failed to remove xxx?
The error may be something like: failed to start xxx?
the learner member was removed successfully

Overall LGTM. Some logs or errors are not aligned.

/approve

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pacoxu, SataQiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@neolit123 neolit123 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problems CAPI users are having is that they have old, big, non-defragmented etcd DBs and when they try to kubeadm join new nodes etcd doesn't have enough time to sync the member data and kubeadm timeouts at the Etcd Client timeout we have for member promotion.

this change adds another timeout in between ControlPlaneComponentHealthCheck, but there is no guarantee if the same users will hit this timeout first. for these big DBs they must configure the timeouts. kubeadm will not block forever, obviously.

the change overall seems OK to me, but kubeadm has a history of breaking something related to the etcd logic, when trying to improve something in the etcd logic.

in terms of backport, my vote is +0.
this is somewhere in between a bugfix and feature improvement.

Comment thread cmd/kubeadm/app/phases/etcd/local.go Outdated
Comment thread cmd/kubeadm/app/phases/etcd/local.go Outdated
Comment thread cmd/kubeadm/app/util/apiclient/wait.go Outdated
Comment thread cmd/kubeadm/app/util/apiclient/wait.go Outdated
Comment thread cmd/kubeadm/app/util/apiclient/wait.go Outdated
Comment thread cmd/kubeadm/app/phases/etcd/local.go Outdated
waiter := apiclient.NewKubeWaiter(client, kubeadmapi.GetActiveTimeouts().ControlPlaneComponentHealthCheck.Duration, os.Stdout)
if err := waiter.WaitForStaticPodRunning(nodeName, kubeadmconstants.Etcd); err != nil {
klog.V(1).Infof("[etcd] The etcd pod did not start successfully, removing learner member %s", etcdPeerAddress)
if removeErr := removeLearnerMember(etcdClient, etcdPeerAddress); removeErr != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this cause problems for users who were previously manually doing the remove?

@SataQiu SataQiu Jan 4, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also something I'm unsure about. Given the doubts about this, do you agree that we should just remove removeLearnerMember call and instead leave it to the users to handle? @neolit123

Even if the unstarted etcd member is not removed from etcd cluster, it will not cause any harmful effects and will not affect the subsequent retry attempts to join the node.

@neolit123 neolit123 Jan 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the member not being removed presents a problem for CAPI users, because CAPI might try to join a machine with a different IP and not that same machine that previously failed, so the previous member will be stale.

on the other hand users who were doing the manual cleanup might now get an error that no such member exists.

i am in favor of supporting the CAPI case, but just trying to understand if we have significant concerns about the manual case.

kubeadm: verify the etcd pod is running before promoting learner

we should add a sentence about that in the RN

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the new joining node is a fresh machine with the same IP, will that be a problem if we did not remove it. (They have to remove it manually then).

Is there a side effect to remove the stale learner member? The etcd data folder should be clean up as well IIUC.

Leaving it to users to manually is also an option. I prefer to remove it if there is no special concern.

@neolit123 neolit123 Jan 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i agree, the correct way seems to do this is to remove the data dir as well.
which pretty much means we need to call the code from app/cmd/phases/reset/removeetcdmember.go from the join phases, which is doable if it's exposed more like a backend, but a bit sketchy.

maybe, if that is not done correctly the partially synced state can no longer be valid. i don't know if etcd will be smart enough to re-sync the whole DB properly again or can that cause problems.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case we only should be removing the member and not the data.

You can't do this. After you removed a member, you can't reuse the data anymore. We discussed this in etcd community. The data should can be reused from user perspective, but there is a technical restriction for now.

ok, so in that case my original proposal for case 2 is valid.
#136014 (comment)

effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#136014 (comment)

effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.

If sync doesn't finish before the etcd API call timeout for learner promotion, does it make more sense to keep retrying? Otherwise, you will have to start over (re-sync the data) again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#136014 (comment)
effectively a failed promotion for various reasons like sync issues, should result in member + it's data removed.

If sync doesn't finish before the etcd API call timeout for learner promotion, does it make more sense to keep retrying? Otherwise, you will have to start over (re-sync the data) again.

it's possible to set a custom timeout for the promotion retry, but kubeadm doesn't block indefinitely in any areas.

@SataQiu SataQiu Jan 6, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what is the conclusion? Does Kubeadm still need to handle the cleanup of folders? Case 2 seems rather challenging. It's not easy for us to determine whether the etcd data can be safely deleted. Perhaps it should be left to the user to handle. For example, performing "kubeadm reset" and then rejoining.

@neolit123 neolit123 Jan 6, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, for case 2 we'd need to check for the sync error, and we don't know who it's going to break if we start removing the member data after the pod has started. i checked the CAPI KCP code and they at least make sure to only delete members that exist.

for case 1, i'm not convinced we need to remove the data dir. if the pod never started the data dir will be empty, right?

@SataQiu SataQiu force-pushed the fix-20260104 branch 4 times, most recently from 04bac7f to bc72d12 Compare January 4, 2026 12:46

@neolit123 neolit123 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/hold

@pacoxu any concerns about the removal of members?
#136014 (comment)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 13, 2026
@HirazawaUi

Copy link
Copy Markdown
Contributor

/lgtm

@neolit123

Copy link
Copy Markdown
Member

#136014 (comment)

@HirazawaUi WDYT, about the above assessment?

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2026
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 09c55e2b2c63cf5afb83500e32a7b9404e3c38de

@neolit123

Copy link
Copy Markdown
Member

/retest

Comment on lines +287 to +294
func (c *Client) listMembersOnce() (*clientv3.MemberListResponse, error) {
cli, err := c.newEtcdClient(c.Endpoints)
if err != nil {
return nil, err
}
defer func() { _ = cli.Close() }()

ctx, cancel := context.WithTimeout(context.Background(), etcdTimeout)

@HirazawaUi HirazawaUi Jan 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I just got off work....
There is still an issue here: we added a 2-second timeout for the request cli.MemberList(ctx) in the listMembersOnce method. However, we are still calling c.getMemberStatus(learnerID) -> c.listMembersOnce() on a 500ms cycle. This may cause requests to etcd to be blocked under certain network anomalies, and all responses could be returned together once the network recovers... I think this does not align with our expectations.

@neolit123 neolit123 Jan 15, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I just got off work....

if it's too late in CN there is no rush to reply.

There is still an issue here: we added a 2-second timeout for the request cli.MemberList(ctx) in the listMembersOnce method. However, we are still calling c.getMemberStatus(learnerID) -> c.listMembersOnce() on a 500ms cycle. This may cause requests to etcd to be blocked under certain network anomalies, and all responses could be returned together once the network recovers... I think this does not align with our expectations.

i believe in many places we have a poll with the 500ms retry interval (e.g. AddMember) but the context timeout for a etcd API call is 2 seconds. i'm not sure what should we do here. is that a refactor for a separate PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also noticed this "inversion" (outer interval duration is shorter than inner context deadline) during an earlier review, but I thought it would be out of scope for this PR, because it happens in various places.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what should we do here. is that a refactor for a separate PR?

Alright, let's do this in a separate PR

@HirazawaUi

Copy link
Copy Markdown
Contributor

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 18, 2026
@HirazawaUi

Copy link
Copy Markdown
Contributor

/retest

@k8s-ci-robot k8s-ci-robot merged commit 02e6747 into kubernetes:master Jan 18, 2026
13 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Jan 18, 2026
k8s-ci-robot added a commit that referenced this pull request Feb 5, 2026
…#136014-upstream-release-1.35

Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
k8s-ci-robot added a commit that referenced this pull request Feb 5, 2026
…#136014-upstream-release-1.33

Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
k8s-ci-robot added a commit that referenced this pull request Feb 5, 2026
…#136014-upstream-release-1.34

Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
k8s-ci-robot added a commit that referenced this pull request Feb 5, 2026
…#136014-upstream-release-1.32

Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants