Give UDP connections a chance to close gracefully#60074
Give UDP connections a chance to close gracefully#60074jsravn wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsravn Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
I'll be testing this out in our dev clusters to see if it helps w/ DNS timeouts. |
|
@jsravn could we delay the udp connection flush by some other param like terminationGracePeriodSeconds rather than hard coding it ? May be some users want that, immediate flush ? |
|
@krmayankk I added it as a flag to kube-proxy so it's configurable - setting to 0 disables, same as prior behavior. It is also similar to TCP connections in netfilter, they have a single timeout. I did think about using terminationGracePeriodSeconds as described in the linked issue. The problem is kube-proxy only knows about endpoints - and terminationGracePeriodSeconds is a pod field. It's doable but would require an API change probably - adding a new field to endpoint that is filled in by the endpoints controller. I think this is a good idea actually, just a bit more work and API change (which I'm not that familiar with doing). |
|
Using terminationGracePeriod also makes sense for the IPVS proxier (#57841). |
pkg/proxy/iptables/proxier.go
Outdated
pkg/proxy/iptables/proxier.go
Outdated
There was a problem hiding this comment.
You need to pass epSvcPair to the goroutine. See this for why.
4ac9734 to
5ac386c
Compare
Delay UDP connection tracking flush on endpoint removal. This gives termination grace period a chance to work, such as when kube-dns is restarted or redeployed. Otherwise in-flight requests will always time out whenever an endpoint removal occurs.
5ac386c to
369acae
Compare
|
@jsravn how did your testing go? We are facing some cantankerous issues with DNS timeouts in our cluster |
|
@bburket Not great I'm afraid. I still am getting dns errors on restarts, with this change cherry picked to 1.8. Either my cherry pick is broken or there is something else going on I don't quite understand yet. |
|
/ok-to-test |
|
@jsravn: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
We are seeing similar issues with kube-dns on cluster-autoscaler downscales. NodeJS seems to have particular pathological behaviour on DNS timeouts, so this is causing us some considerable pain. |
|
Sorry I got pulled off on other work, so haven't had time to work on this further. |
|
/assign Probably IPVS side needs a similar change. |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
@m1093782566 seems like we probably still want this...? |
|
Thanks but I would prefer #66012 which is up-to-date and fix both TCP and UDP issues. |
|
It looks like #66012 only contains IPVS changes though, while this one has iptables? |
|
oops, seems you are right. This PR covers iptables while #66012 covers IPVS - they do the different thing. |
|
xref: #71514 |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
/remove-lifecycle stale |
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
|
Going to close this - as a lot of behavior currently relies on the UDP connections being dropped immediately. IPVS proxier has also done the same recently. |
What this PR does / why we need it:
Delay UDP connection tracking flush on endpoint removal.
This gives termination grace period a chance to work, such as when
kube-dns is restarted or redeployed.
Otherwise in-flight requests will always time out whenever an endpoint
removal occurs.
We get DNS timeouts occasionally when kube-dns redeploys. I believe the immediate UDP flush is what causes it.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):For #45976
Special notes for your reviewer:
Release note: