Skip to content

local-up-cluster kube-proxy terminated error#82413

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
zhlhahaha:kube-proxy-error
Oct 12, 2019
Merged

local-up-cluster kube-proxy terminated error#82413
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
zhlhahaha:kube-proxy-error

Conversation

@zhlhahaha
Copy link
Copy Markdown
Contributor

When using hack/local-up-cluster.sh deploy local cluster, it
failed with following message "kube-proxy terminated unexpectedly"
and "Failed to retrieve node info: nodes "127.0.0.1" not found" in
kube-proxy.log.

The root reason for this error is miss boot order of kubernetes
services in local-up-cluster.sh, kube-proxy and kubectl daemon.

When starting kube-proxy, it would check node information. And
these information are collected by kubelet daemon. However, in
the shell script, kube-proxy service start before kubelet daemon.

This patch changed the boot order of kubelet daemon and kube-proxy
and check if node stats ready for kube-proxy start.

What type of PR is this?

/kind bug

What this PR does / why we need it:
When using hack/local-up-cluster.sh deploy local cluster, it
failed with following message "kube-proxy terminated unexpectedly"
and "Failed to retrieve node info: nodes "127.0.0.1" not found" in
kube-proxy.log.

The root reason for this error is miss boot order of kubernetes
services in local-up-cluster.sh, kube-proxy and kubectl daemon.

When starting kube-proxy, it would check node information. And
these information are collected by kubelet daemon. However, in
the shell script, kube-proxy service start before kubelet daemon.

This patch changed the boot order of kubelet daemon and kube-proxy
and check if node stats ready for kube-proxy start.

Which issue(s) this PR fixes:
Fixes #81879

Special notes for your reviewer:
no

Does this PR introduce a user-facing change?:
no

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
no

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@zhlhahaha: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 6, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @zhlhahaha. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 6, 2019
@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/assign @vishh

@BenTheElder
Copy link
Copy Markdown
Member

/assign @dims

@lubinszARM
Copy link
Copy Markdown

/ok-to-test

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@lubinszARM: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Copy Markdown
Member

dims commented Sep 9, 2019

/ok-to-test
/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 9, 2019
@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/retest

@MikeSpreitzer
Copy link
Copy Markdown
Member

/test pull-kubernetes-local-e2e

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/retest

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-local-e2e

@YuikoTakada
Copy link
Copy Markdown
Contributor

thank you for the PR. I'm facing the same issue. Previously this error didn't occur, so there is something other problem, but this PR seems useful as workaround :)

@MikeSpreitzer
Copy link
Copy Markdown
Member

/retest

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

@MikeSpreitzer Hi, I get your idea.
Yes, I had tried to just swap the boot order. It needs to set sleep time for kubelet service start, otherwise kubeproxy may fail to start.
For pull-kubernetes-local-e2e test, I can only see "time out" lead to test failure form log file. I also checked other PR that made changes to local-up-cluster.sh file, this test failed too with the same error message. e.g. #81268
And other PR just skip pull-kubernetes-local-e2e test. Is there any way to get more information about the test?

@MikeSpreitzer
Copy link
Copy Markdown
Member

@zhlhahaha : what exactly do you mean by "It needs to set sleep time for kubelet service start, otherwise kubeproxy may fail to start"? If the only change is to swap the order then there is no sleep. I wonder whether by "may" you mean there is a general worry, or an observed problem. Did you test a change that only swaps the order, does not introduce a wait, and observe that the kube-proxy still erred due to lack of finding its Node?

Regarding debugging the e2e failure, you can find leads in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/testing.md

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

er, does not introduce a wait, and observe that the kube-proxy still erred due to lack of finding its Node?

Hi Mike,
Of course, we need to wait for kubelet daemon collection information, and that is why I add a wait function in this PR. A similar function can be found in function start_apiserver in local_up_cluster.sh, in which waiting apiserver up is needed.

I do understand your idea, and I understand #81880
is a good solution as it can help kubeproxy go through the gap when kubelet daemon collecting information.

I do not think the two solutions conflict with each other.
Here are two purposes for my PR,

  1. Give correct boot process for kubelet and kubeproxy.
  2. Give more information to users to acknowledge local cluster boot process. User can get clear hint when kubeproxy service start failed because of unable getting node info.

Thanks for your suggestion on e2e test failure, it can take some time to learn how to do it.

@MikeSpreitzer
Copy link
Copy Markdown
Member

MikeSpreitzer commented Sep 19, 2019

Yes, I understand the virtue of adding a wait in the local-up-cluster script. But I am concerned by the consistent failures of pull-kubernetes-local-e2e. Could the added wait somehow be causing the failures of pull-kubernetes-local-e2e?

Since kube-proxy has a wait inside itself (note that #81880 only increases the duratinon of an existing wait), it is not obvious to me that adding an additional wait is necessary. I understand why waiting and reporting on the outcome in the local-up-cluster script is helpful to users. My question is, if we only swap the order and rely on the 5-try wait already in kube-proxy, is that sufficient to make local-up-cluster succeed?

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

Yes, I understand the virtue of adding a wait in the local-up-cluster script. But I am concerned by the consistent failures of pull-kubernetes-local-e2e. Could the added wait somehow be causing the failures of pull-kubernetes-local-e2e?

Since kube-proxy has a wait inside itself (note that #81880 only increases the duratinon of an existing wait), it is not obvious to me that adding an additional wait is necessary. I understand why waiting and reporting on the outcome in the local-up-cluster script is helpful to users. My question is, if we only swap the order and rely on the 5-try wait already in kube-proxy, is that sufficient to make local-up-cluster succeed?

Hi Mike,
5-try wait is not long enough for kubelet collect node information, and kube-proxy fail to start. And only swapping the order does not work either.
For the wait function, to avoid unnecessory wait time, I just put node status check every 2 seconds and it would continue once node status check pass. And the total wait time is 30 seconds. I am not sure why it always fails for pull-kubernetes-local-e2e test. I need to build a test environment and make a try.

@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/retest

@BenTheElder
Copy link
Copy Markdown
Member

#83792
/retest

@BenTheElder
Copy link
Copy Markdown
Member

/uncc
/cc @dims

@k8s-ci-robot k8s-ci-robot requested a review from dims October 11, 2019 22:42
@zhlhahaha
Copy link
Copy Markdown
Contributor Author

zhlhahaha commented Oct 12, 2019

#83792
/retest

Thanks Ben, I have trapped here for a long~~ time.

@lubinsz
Copy link
Copy Markdown

lubinsz commented Oct 12, 2019

/kind cleanup
/sig testing

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 12, 2019
@dims
Copy link
Copy Markdown
Member

dims commented Oct 12, 2019

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2019
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, zhlhahaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2019
@zhlhahaha
Copy link
Copy Markdown
Contributor Author

/retest

@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link
Copy Markdown

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 457fa6b into kubernetes:master Oct 12, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 12, 2019
ramineni added a commit to ramineni/openlab-zuul-jobs that referenced this pull request Oct 15, 2019
Recent changes in k8s , kubernetes/kubernetes#82413
checks for KUBELET_HOST in get nodes info, which is resulting in
error. This commit is to update the same.
ohsewon pushed a commit to ohsewon/kubernetes that referenced this pull request Oct 16, 2019
local-up-cluster kube-proxy terminated error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kube-proxy gives up too soon waiting for node registration in hack/local-up-cluster.sh

10 participants