local-up-cluster kube-proxy terminated error by zhlhahaha · Pull Request #82413 · kubernetes/kubernetes

zhlhahaha · 2019-09-06T10:48:50Z

When using hack/local-up-cluster.sh deploy local cluster, it
failed with following message "kube-proxy terminated unexpectedly"
and "Failed to retrieve node info: nodes "127.0.0.1" not found" in
kube-proxy.log.

The root reason for this error is miss boot order of kubernetes
services in local-up-cluster.sh, kube-proxy and kubectl daemon.

When starting kube-proxy, it would check node information. And
these information are collected by kubelet daemon. However, in
the shell script, kube-proxy service start before kubelet daemon.

This patch changed the boot order of kubelet daemon and kube-proxy
and check if node stats ready for kube-proxy start.

What type of PR is this?

/kind bug

What this PR does / why we need it:
When using hack/local-up-cluster.sh deploy local cluster, it
failed with following message "kube-proxy terminated unexpectedly"
and "Failed to retrieve node info: nodes "127.0.0.1" not found" in
kube-proxy.log.

The root reason for this error is miss boot order of kubernetes
services in local-up-cluster.sh, kube-proxy and kubectl daemon.

When starting kube-proxy, it would check node information. And
these information are collected by kubelet daemon. However, in
the shell script, kube-proxy service start before kubelet daemon.

This patch changed the boot order of kubelet daemon and kube-proxy
and check if node stats ready for kube-proxy start.

Which issue(s) this PR fixes:
Fixes #81879

Special notes for your reviewer:
no

Does this PR introduce a user-facing change?:
no

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
no

k8s-ci-robot · 2019-09-06T10:48:51Z

@zhlhahaha: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2019-09-06T10:48:57Z

Hi @zhlhahaha. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zhlhahaha · 2019-09-06T10:55:05Z

/assign @vishh

BenTheElder · 2019-09-07T19:16:39Z

/assign @dims

lubinszARM · 2019-09-09T01:51:15Z

/ok-to-test

k8s-ci-robot · 2019-09-09T01:51:28Z

@lubinszARM: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/ok-to-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2019-09-09T10:35:04Z

/ok-to-test
/release-note-none

zhlhahaha · 2019-09-09T11:15:49Z

/retest

zhlhahaha · 2019-09-09T11:37:33Z

/retest

MikeSpreitzer · 2019-09-09T19:09:59Z

/test pull-kubernetes-local-e2e

zhlhahaha · 2019-09-10T01:19:45Z

/retest

zhlhahaha · 2019-09-10T06:09:58Z

/test pull-kubernetes-local-e2e

YuikoTakada · 2019-09-13T06:28:30Z

thank you for the PR. I'm facing the same issue. Previously this error didn't occur, so there is something other problem, but this PR seems useful as workaround :)

MikeSpreitzer · 2019-09-17T16:00:10Z

/retest

zhlhahaha · 2019-09-18T01:50:50Z

@MikeSpreitzer Hi, I get your idea.
Yes, I had tried to just swap the boot order. It needs to set sleep time for kubelet service start, otherwise kubeproxy may fail to start.
For pull-kubernetes-local-e2e test, I can only see "time out" lead to test failure form log file. I also checked other PR that made changes to local-up-cluster.sh file, this test failed too with the same error message. e.g. #81268
And other PR just skip pull-kubernetes-local-e2e test. Is there any way to get more information about the test?

MikeSpreitzer · 2019-09-18T17:32:21Z

@zhlhahaha : what exactly do you mean by "It needs to set sleep time for kubelet service start, otherwise kubeproxy may fail to start"? If the only change is to swap the order then there is no sleep. I wonder whether by "may" you mean there is a general worry, or an observed problem. Did you test a change that only swaps the order, does not introduce a wait, and observe that the kube-proxy still erred due to lack of finding its Node?

Regarding debugging the e2e failure, you can find leads in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/testing.md

zhlhahaha · 2019-09-19T02:42:34Z

er, does not introduce a wait, and observe that the kube-proxy still erred due to lack of finding its Node?

Hi Mike,
Of course, we need to wait for kubelet daemon collection information, and that is why I add a wait function in this PR. A similar function can be found in function start_apiserver in local_up_cluster.sh, in which waiting apiserver up is needed.

I do understand your idea, and I understand #81880
is a good solution as it can help kubeproxy go through the gap when kubelet daemon collecting information.

I do not think the two solutions conflict with each other.
Here are two purposes for my PR,

Give correct boot process for kubelet and kubeproxy.
Give more information to users to acknowledge local cluster boot process. User can get clear hint when kubeproxy service start failed because of unable getting node info.

Thanks for your suggestion on e2e test failure, it can take some time to learn how to do it.

MikeSpreitzer · 2019-09-19T03:22:26Z

Yes, I understand the virtue of adding a wait in the local-up-cluster script. But I am concerned by the consistent failures of pull-kubernetes-local-e2e. Could the added wait somehow be causing the failures of pull-kubernetes-local-e2e?

Since kube-proxy has a wait inside itself (note that #81880 only increases the duratinon of an existing wait), it is not obvious to me that adding an additional wait is necessary. I understand why waiting and reporting on the outcome in the local-up-cluster script is helpful to users. My question is, if we only swap the order and rely on the 5-try wait already in kube-proxy, is that sufficient to make local-up-cluster succeed?

zhlhahaha · 2019-09-19T10:57:38Z

Yes, I understand the virtue of adding a wait in the local-up-cluster script. But I am concerned by the consistent failures of pull-kubernetes-local-e2e. Could the added wait somehow be causing the failures of pull-kubernetes-local-e2e?

Since kube-proxy has a wait inside itself (note that #81880 only increases the duratinon of an existing wait), it is not obvious to me that adding an additional wait is necessary. I understand why waiting and reporting on the outcome in the local-up-cluster script is helpful to users. My question is, if we only swap the order and rely on the 5-try wait already in kube-proxy, is that sufficient to make local-up-cluster succeed?

Hi Mike,
5-try wait is not long enough for kubelet collect node information, and kube-proxy fail to start. And only swapping the order does not work either.
For the wait function, to avoid unnecessory wait time, I just put node status check every 2 seconds and it would continue once node status check pass. And the total wait time is 30 seconds. I am not sure why it always fails for pull-kubernetes-local-e2e test. I need to build a test environment and make a try.

zhlhahaha · 2019-10-11T02:16:28Z

/retest

BenTheElder · 2019-10-11T22:35:47Z

#83792
/retest

BenTheElder · 2019-10-11T22:42:54Z

/uncc
/cc @dims

zhlhahaha · 2019-10-12T01:50:49Z

#83792
/retest

Thanks Ben, I have trapped here for a long~~ time.

lubinsz · 2019-10-12T01:53:10Z

/kind cleanup
/sig testing

dims · 2019-10-12T01:54:30Z

/approve
/lgtm

k8s-ci-robot · 2019-10-12T01:55:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, zhlhahaha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~hack/OWNERS~~ [dims]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zhlhahaha · 2019-10-12T04:39:05Z

/retest

fejta-bot · 2019-10-12T09:18:01Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

fejta-bot · 2019-10-12T12:27:02Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

Recent changes in k8s , kubernetes/kubernetes#82413 checks for KUBELET_HOST in get nodes info, which is resulting in error. This commit is to update the same.

local-up-cluster kube-proxy terminated error

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 6, 2019

k8s-ci-robot requested review from eparis and zmerlynn September 6, 2019 10:49

k8s-ci-robot assigned vishh Sep 6, 2019

k8s-ci-robot assigned dims Sep 7, 2019

zhlhahaha force-pushed the kube-proxy-error branch from 6d86654 to b9ea7fe Compare September 9, 2019 12:56

zhlhahaha force-pushed the kube-proxy-error branch from b9ea7fe to 95e6703 Compare September 10, 2019 03:42

k8s-ci-robot requested a review from dims October 11, 2019 22:42

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 12, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2019

k8s-ci-robot merged commit 457fa6b into kubernetes:master Oct 12, 2019

k8s-ci-robot added this to the v1.17 milestone Oct 12, 2019

ramineni added a commit to ramineni/openlab-zuul-jobs that referenced this pull request Oct 15, 2019

Fix CPO jobs

a295781

Recent changes in k8s , kubernetes/kubernetes#82413 checks for KUBELET_HOST in get nodes info, which is resulting in error. This commit is to update the same.

ramineni mentioned this pull request Oct 15, 2019

WIP: update KUBELET_HOST to contain actual ip theopenlab/openlab-zuul-jobs#682

Closed

ohsewon pushed a commit to ohsewon/kubernetes that referenced this pull request Oct 16, 2019

Merge pull request kubernetes#82413 from zhlhahaha/kube-proxy-error

f8a2714

local-up-cluster kube-proxy terminated error

ramineni mentioned this pull request Oct 16, 2019

hack/local-up-cluster.sh fails if HOSTNAME_OVERRIDE is set #83985

Closed

zhlhahaha mentioned this pull request Mar 25, 2020

REQUEST: New membership for zhlhahaha kubernetes/org#1742

Closed

6 tasks

Conversation

zhlhahaha commented Sep 6, 2019

Uh oh!

k8s-ci-robot commented Sep 6, 2019

Uh oh!

k8s-ci-robot commented Sep 6, 2019

Uh oh!

zhlhahaha commented Sep 6, 2019

Uh oh!

BenTheElder commented Sep 7, 2019

Uh oh!

lubinszARM commented Sep 9, 2019

Uh oh!

k8s-ci-robot commented Sep 9, 2019

Uh oh!

dims commented Sep 9, 2019

Uh oh!

zhlhahaha commented Sep 9, 2019

Uh oh!

zhlhahaha commented Sep 9, 2019

Uh oh!

MikeSpreitzer commented Sep 9, 2019

Uh oh!

zhlhahaha commented Sep 10, 2019

Uh oh!

zhlhahaha commented Sep 10, 2019

Uh oh!

YuikoTakada commented Sep 13, 2019

Uh oh!

MikeSpreitzer commented Sep 17, 2019

Uh oh!

zhlhahaha commented Sep 18, 2019

Uh oh!

MikeSpreitzer commented Sep 18, 2019

Uh oh!

zhlhahaha commented Sep 19, 2019

Uh oh!

MikeSpreitzer commented Sep 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhlhahaha commented Sep 19, 2019

Uh oh!

zhlhahaha commented Oct 11, 2019

Uh oh!

BenTheElder commented Oct 11, 2019

Uh oh!

BenTheElder commented Oct 11, 2019

Uh oh!

zhlhahaha commented Oct 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lubinsz commented Oct 12, 2019

Uh oh!

dims commented Oct 12, 2019

Uh oh!

k8s-ci-robot commented Oct 12, 2019

Uh oh!

zhlhahaha commented Oct 12, 2019

Uh oh!

fejta-bot commented Oct 12, 2019

Uh oh!

fejta-bot commented Oct 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

MikeSpreitzer commented Sep 19, 2019 •

edited

Loading

zhlhahaha commented Oct 12, 2019 •

edited

Loading