Bumped the number of times a node tries to lookup itself#81880
Bumped the number of times a node tries to lookup itself#81880k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
Increased the number of tries in pkg/util/node/node.go::GetNodeIP by 1, because the kube-proxy was giving up too early. This is meant to address kubernetes#81879
|
@kubernetes/sig-node-bugs |
There was a problem hiding this comment.
Thanks for your pr here :)
I'm not entirely opposed to this solution. However, it does feel like we're implementing a bit of a stop-gap solution. I wonder, did you explore how difficult it would be to make this value a parameter that the user can specify? I think a parameter is justifiable here, acceptable wait time does seem like it would very based on different use cases and different environments in which the node is running.
|
No, I did not consider adding a parameter. Before asking about the difficulty of implementing it I wonder about the advisability of doing it. Do we really want to increase the operational complexity of the relevant process(es)? Would it be sufficient to observe that the revised fixed time limit causes no problems? |
|
On Sat, Aug 24, 2019 at 07:54:51AM -0700, Mike Spreitzer wrote:
No, I did not consider adding a parameter. Before asking about the difficulty of implementing it I wonder about the advisability of doing it. I am not enthusiastic about increasing the complexity of invoking the relevant processes. As long as the revised fixed time limit does not conflict with other time limits, I would rather do that and keep it simpler for operators.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#81880 (comment)
Definitely see your point and also don't want to increase complexity :)
At the same time, my concern is that currently updating this value requires a
new k8s major release. So if someone was in a similar situation to yourself, but
needed to use 7 instead of 6, they'd need to wait 3 months for that to be
released.
I wonder if a nice middle ground is a parameter with the sensible default of 6?
For users who are happy with 6, nothing has to change. But for users who need a
different value, they don't need to wait for a k8s release.
What do you think?
|
|
@mattjmcnaughton : that's what I thought you meant in the first place. But I see your point that if a problem is discovered then it takes a long time to fix. I will sadly draft a revision that adds a parameter. |
|
On Sun, Aug 25, 2019 at 09:02:33AM -0700, Mike Spreitzer wrote:
@mattjmcnaughton : that's what I thought you meant in the first place. But I see your point that if a problem is discovered then it takes a long time to fix. I will sadly draft a revision that adds a parameter.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#81880 (comment)
Also an option to wait for someone else to chime in with their thoughts - I'm
just one opinion :)
|
2386327 to
9dfe281
Compare
|
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
|
/test pull-kubernetes-bazel-build |
|
/cc @yliaog |
9dfe281 to
3bb3db1
Compare
|
OK, I changed the branch in this PR back to pointing at simply bumping the count. |
|
/remove-kind api-change |
|
/priority important-soon |
|
I am +1 on a simple fix. Why 6 and not 10, though? What is the practical downside of just waiting longer? It looks like this is just used by kube-proxy, so adding a retries param to GetNodeIP() seems viable, too (if we want to keep this function really generic), but I don't see why we NEED to do that absent a real use case. |
|
@thockin : I was nervous about waiting longer because when I first started working on a related issue I heard about higher level management logic with timeouts that are longer than the current timeouts but not like 500 seconds, and did not want to change the way this logic interacts with that. |
|
"6" (which translates to a total wait of 31 seconds) seems to work in practice. If we really needed to bump beyond that then we should probably rework the way that |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MikeSpreitzer, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
Thanks for popping this back onto my radar. Not sure why this never completed. @BenTheElder can you enlighten? We'll need to cherry pick this to 1.17 |
|
I think this PR was opened before this presubmit was running or required, had no code changes since then, so it was never started. I did enable it running and reporting without blocking merges for O(months) before we made it blocking. Prow handles this poorly. IIRC there actually is support for triggering newly required jobs but that isn't enabled for some reason. |
|
/retest |
|
/retest Review the full test history for this PR. Silence the bot with an |
…pstream-release-1.17 Automated cherry pick of #81880: Bumped the number of times a node tries to lookup itself
What type of PR is this?
/kind bug
What this PR does / why we need it:
Increased the number of tries in pkg/util/node/node.go::GetNodeIP by 1, because the kube-proxy was giving up too early.
Based on early feedback, this PR went further (but no longer does):
Which issue(s) this PR fixes:
Fixes #81879
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: