Bumped the number of times a node tries to lookup itself by MikeSpreitzer · Pull Request #81880 · kubernetes/kubernetes

MikeSpreitzer · 2019-08-24T02:08:41Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change

/kind bug

/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
Increased the number of tries in pkg/util/node/node.go::GetNodeIP by 1, because the kube-proxy was giving up too early.

Based on early feedback, this PR went further (but no longer does):

The second commit goes much further, introducing a parameter and a new exponential backoff function in the wait utility package.

Which issue(s) this PR fixes:

Fixes #81879

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Increased the number of tries in pkg/util/node/node.go::GetNodeIP by 1, because the kube-proxy was giving up too early. This is meant to address kubernetes#81879

MikeSpreitzer · 2019-08-24T02:09:04Z

@kubernetes/sig-node-bugs

MikeSpreitzer · 2019-08-24T03:34:35Z

@MikeSpreitzer
@yue9944882

mattjmcnaughton

Thanks for your pr here :)

I'm not entirely opposed to this solution. However, it does feel like we're implementing a bit of a stop-gap solution. I wonder, did you explore how difficult it would be to make this value a parameter that the user can specify? I think a parameter is justifiable here, acceptable wait time does seem like it would very based on different use cases and different environments in which the node is running.

MikeSpreitzer · 2019-08-24T14:54:11Z

No, I did not consider adding a parameter. Before asking about the difficulty of implementing it I wonder about the advisability of doing it. Do we really want to increase the operational complexity of the relevant process(es)? Would it be sufficient to observe that the revised fixed time limit causes no problems?

mattjmcnaughton · 2019-08-25T14:24:30Z

On Sat, Aug 24, 2019 at 07:54:51AM -0700, Mike Spreitzer wrote: No, I did not consider adding a parameter. Before asking about the difficulty of implementing it I wonder about the advisability of doing it. I am not enthusiastic about increasing the complexity of invoking the relevant processes. As long as the revised fixed time limit does not conflict with other time limits, I would rather do that and keep it simpler for operators. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #81880 (comment)

Definitely see your point and also don't want to increase complexity :) At the same time, my concern is that currently updating this value requires a new k8s major release. So if someone was in a similar situation to yourself, but needed to use 7 instead of 6, they'd need to wait 3 months for that to be released. I wonder if a nice middle ground is a parameter with the sensible default of 6? For users who are happy with 6, nothing has to change. But for users who need a different value, they don't need to wait for a k8s release. What do you think?

MikeSpreitzer · 2019-08-25T16:01:52Z

@mattjmcnaughton : that's what I thought you meant in the first place. But I see your point that if a problem is discovered then it takes a long time to fix. I will sadly draft a revision that adds a parameter.

mattjmcnaughton · 2019-08-25T20:28:13Z

On Sun, Aug 25, 2019 at 09:02:33AM -0700, Mike Spreitzer wrote: @mattjmcnaughton : that's what I thought you meant in the first place. But I see your point that if a problem is discovered then it takes a long time to fix. I will sadly draft a revision that adds a parameter. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #81880 (comment)

Also an option to wait for someone else to chime in with their thoughts - I'm just one opinion :)

fejta-bot · 2019-08-26T00:15:37Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

MikeSpreitzer · 2019-08-26T14:32:32Z

/test pull-kubernetes-bazel-build

roycaihw · 2019-08-26T21:10:43Z

/cc @yliaog

MikeSpreitzer · 2019-09-06T22:05:52Z

OK, I changed the branch in this PR back to pointing at simply bumping the count.
The more extensive revision is preserved as https://github.com/MikeSpreitzer/kubernetes/tree/fix81879%2Bparam

mattjmcnaughton

/lgtm

thanks for your work here @MikeSpreitzer !

/assign @smarterclayton

mattjmcnaughton · 2019-09-07T16:09:13Z

/remove-kind api-change

MikeSpreitzer · 2019-09-09T04:13:34Z

/priority important-soon

thockin · 2019-09-09T16:02:03Z

I am +1 on a simple fix. Why 6 and not 10, though? What is the practical downside of just waiting longer?

It looks like this is just used by kube-proxy, so adding a retries param to GetNodeIP() seems viable, too (if we want to keep this function really generic), but I don't see why we NEED to do that absent a real use case.

MikeSpreitzer · 2019-09-16T14:36:12Z

@thockin : I was nervous about waiting longer because when I first started working on a related issue I heard about higher level management logic with timeouts that are longer than the current timeouts but not like 500 seconds, and did not want to change the way this logic interacts with that.

danwinship · 2019-10-05T13:45:14Z

"6" (which translates to a total wait of 31 seconds) seems to work in practice. If we really needed to bump beyond that then we should probably rework the way that local-up-cluster.sh startup works anyway.

thockin

Thanks!

/lgtm
/approve

k8s-ci-robot · 2019-11-15T01:15:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MikeSpreitzer, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thockin · 2019-11-27T02:28:56Z

/retest

thockin · 2019-11-27T02:30:40Z

Thanks for popping this back onto my radar. Not sure why this never completed. @BenTheElder can you enlighten?

We'll need to cherry pick this to 1.17

dims · 2019-11-27T02:37:50Z

@thockin please approve #85663 for the cherry-pick

BenTheElder · 2019-11-27T02:37:54Z

I think this PR was opened before this presubmit was running or required, had no code changes since then, so it was never started. I did enable it running and reporting without blocking merges for O(months) before we made it blocking.

Prow handles this poorly. IIRC there actually is support for triggering newly required jobs but that isn't enabled for some reason.

MikeSpreitzer · 2019-11-27T06:12:11Z

/retest

fejta-bot · 2019-11-27T09:11:58Z

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

…pstream-release-1.17 Automated cherry pick of #81880: Bumped the number of times a node tries to lookup itself

Bumped the number of times a node tries to lookup itself

3bb3db1

Increased the number of tries in pkg/util/node/node.go::GetNodeIP by 1, because the kube-proxy was giving up too early. This is meant to address kubernetes#81879

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 24, 2019

k8s-ci-robot requested review from dchen1107 and thockin August 24, 2019 02:09

mattjmcnaughton reviewed Aug 24, 2019

View reviewed changes

MikeSpreitzer force-pushed the fix81879 branch 3 times, most recently from 2386327 to 9dfe281 Compare August 25, 2019 22:13

k8s-ci-robot requested a review from yliaog August 26, 2019 21:10

MikeSpreitzer force-pushed the fix81879 branch from 9dfe281 to 3bb3db1 Compare September 6, 2019 22:04

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 6, 2019

mattjmcnaughton approved these changes Sep 7, 2019

View reviewed changes

k8s-ci-robot assigned mattjmcnaughton Sep 7, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2019

k8s-ci-robot removed the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Sep 7, 2019

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Sep 9, 2019

thockin self-assigned this Sep 9, 2019

MikeSpreitzer mentioned this pull request Sep 16, 2019

local-up-cluster kube-proxy terminated error #82413

Merged

thockin reviewed Nov 15, 2019

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 15, 2019

dims mentioned this pull request Nov 27, 2019

Automated cherry pick of #81880: Bumped the number of times a node tries to lookup itself #85663

Merged

k8s-ci-robot merged commit 7ed5eb6 into kubernetes:master Nov 27, 2019

k8s-ci-robot added this to the v1.18 milestone Nov 27, 2019

k8s-ci-robot added a commit that referenced this pull request Nov 28, 2019

Merge pull request #85663 from dims/automated-cherry-pick-of-#81880-u…

4caae3c

…pstream-release-1.17 Automated cherry pick of #81880: Bumped the number of times a node tries to lookup itself

Conversation

MikeSpreitzer commented Aug 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MikeSpreitzer commented Aug 24, 2019

Uh oh!

MikeSpreitzer commented Aug 24, 2019

Uh oh!

mattjmcnaughton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MikeSpreitzer commented Aug 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattjmcnaughton commented Aug 25, 2019 via email

Uh oh!

MikeSpreitzer commented Aug 25, 2019

Uh oh!

mattjmcnaughton commented Aug 25, 2019 via email

Uh oh!

fejta-bot commented Aug 26, 2019

Uh oh!

MikeSpreitzer commented Aug 26, 2019

Uh oh!

roycaihw commented Aug 26, 2019

Uh oh!

MikeSpreitzer commented Sep 6, 2019

Uh oh!

mattjmcnaughton left a comment

Choose a reason for hiding this comment

Uh oh!

mattjmcnaughton commented Sep 7, 2019

Uh oh!

MikeSpreitzer commented Sep 9, 2019

Uh oh!

thockin commented Sep 9, 2019

Uh oh!

MikeSpreitzer commented Sep 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danwinship commented Oct 5, 2019

Uh oh!

thockin left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 15, 2019

Uh oh!

thockin commented Nov 27, 2019

Uh oh!

thockin commented Nov 27, 2019

Uh oh!

dims commented Nov 27, 2019

Uh oh!

BenTheElder commented Nov 27, 2019

Uh oh!

MikeSpreitzer commented Nov 27, 2019

Uh oh!

fejta-bot commented Nov 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

MikeSpreitzer commented Aug 24, 2019 •

edited

Loading

mattjmcnaughton left a comment •

edited

Loading

MikeSpreitzer commented Aug 24, 2019 •

edited

Loading

MikeSpreitzer commented Sep 16, 2019 •

edited

Loading