Give UDP connections a chance to close gracefully by jsravn · Pull Request #60074 · kubernetes/kubernetes

jsravn · 2018-02-20T14:08:04Z

What this PR does / why we need it:

Delay UDP connection tracking flush on endpoint removal.

This gives termination grace period a chance to work, such as when
kube-dns is restarted or redeployed.

Otherwise in-flight requests will always time out whenever an endpoint
removal occurs.

We get DNS timeouts occasionally when kube-dns redeploys. I believe the immediate UDP flush is what causes it.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
For #45976

Special notes for your reviewer:

Release note:

Delay UDP connection flush to give connections a chance to gracefully terminate.

k8s-ci-robot · 2018-02-21T11:14:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsravn
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: thockin

Assign the PR to them by writing /assign @thockin in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jsravn · 2018-02-21T11:22:25Z

I'll be testing this out in our dev clusters to see if it helps w/ DNS timeouts.

krmayankk · 2018-02-22T08:34:40Z

@jsravn could we delay the udp connection flush by some other param like terminationGracePeriodSeconds rather than hard coding it ? May be some users want that, immediate flush ?

jsravn · 2018-02-22T09:29:05Z

@krmayankk I added it as a flag to kube-proxy so it's configurable - setting to 0 disables, same as prior behavior. It is also similar to TCP connections in netfilter, they have a single timeout.

I did think about using terminationGracePeriodSeconds as described in the linked issue. The problem is kube-proxy only knows about endpoints - and terminationGracePeriodSeconds is a pod field. It's doable but would require an API change probably - adding a new field to endpoint that is filled in by the endpoints controller. I think this is a good idea actually, just a bit more work and API change (which I'm not that familiar with doing).

jsravn · 2018-02-22T09:31:59Z

Using terminationGracePeriod also makes sense for the IPVS proxier (#57841).

jhorwit2 · 2018-02-21T03:26:16Z

pkg/proxy/iptables/proxier.go

This looks like it'll fail because of this

jhorwit2 · 2018-02-26T02:21:03Z

pkg/proxy/iptables/proxier.go

You need to pass epSvcPair to the goroutine. See this for why.

Delay UDP connection tracking flush on endpoint removal. This gives termination grace period a chance to work, such as when kube-dns is restarted or redeployed. Otherwise in-flight requests will always time out whenever an endpoint removal occurs.

bburket · 2018-03-01T08:44:43Z

@jsravn how did your testing go? We are facing some cantankerous issues with DNS timeouts in our cluster

jsravn · 2018-03-01T16:56:36Z

@bburket Not great I'm afraid. I still am getting dns errors on restarts, with this change cherry picked to 1.8. Either my cherry pick is broken or there is something else going on I don't quite understand yet.

BenTheElder · 2018-03-02T06:45:30Z

/ok-to-test

k8s-ci-robot · 2018-03-02T07:02:14Z

@jsravn: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-bazel-test	`369acae`	link	`/test pull-kubernetes-bazel-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

tcolgate · 2018-04-17T09:02:57Z

We are seeing similar issues with kube-dns on cluster-autoscaler downscales. NodeJS seems to have particular pathological behaviour on DNS timeouts, so this is causing us some considerable pain.

jsravn · 2018-04-17T09:30:54Z

Sorry I got pulled off on other work, so haven't had time to work on this further.

m1093782566 · 2018-06-06T05:39:00Z

/assign

Probably IPVS side needs a similar change.

fejta-bot · 2018-09-04T05:49:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

BenTheElder · 2018-09-04T06:26:37Z

@m1093782566 seems like we probably still want this...?
/remove-lifecycle stale

m1093782566 · 2018-09-04T06:34:30Z

@BenTheElder

Thanks but I would prefer #66012 which is up-to-date and fix both TCP and UDP issues.

BenTheElder · 2018-09-04T06:41:21Z

It looks like #66012 only contains IPVS changes though, while this one has iptables?

m1093782566 · 2018-09-04T06:44:12Z

oops, seems you are right. This PR covers iptables while #66012 covers IPVS - they do the different thing.

m1093782566 · 2018-11-29T04:22:43Z

xref: #71514

fejta-bot · 2019-02-27T04:54:12Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jsravn · 2019-03-01T09:58:44Z

/remove-lifecycle stale

fejta-bot · 2019-05-30T10:45:17Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jsravn · 2019-05-31T12:37:10Z

Going to close this - as a lot of behavior currently relies on the UDP connections being dropped immediately. IPVS proxier has also done the same recently.

k8s-ci-robot requested review from dcbw and justinsb February 20, 2018 14:08

jsravn mentioned this pull request Feb 20, 2018

kube-dns: dnsmasq intermittent connection refused #45976

Closed

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 21, 2018

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2018

jhorwit2 suggested changes Feb 26, 2018

View reviewed changes

jsravn force-pushed the improve-udp-endpoint-handling branch from 4ac9734 to 5ac386c Compare February 28, 2018 10:42

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2018

jsravn force-pushed the improve-udp-endpoint-handling branch from 5ac386c to 369acae Compare February 28, 2018 10:44

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 2, 2018

rramkumar1 mentioned this pull request May 2, 2018

ipvs proxier doesn't respect graceful termination #57841

Closed

k8s-ci-robot assigned m1093782566 Jun 6, 2018

sergeylanzman mentioned this pull request Jun 9, 2018

[WIP] add grace period delay to ipvs proxier #64947

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 4, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 27, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 1, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2019

jsravn closed this May 31, 2019

Conversation

jsravn commented Feb 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 21, 2018

Uh oh!

jsravn commented Feb 21, 2018

Uh oh!

krmayankk commented Feb 22, 2018

Uh oh!

jsravn commented Feb 22, 2018

Uh oh!

jsravn commented Feb 22, 2018

Uh oh!

jhorwit2 Feb 21, 2018

Choose a reason for hiding this comment

Uh oh!

jhorwit2 Feb 26, 2018

Choose a reason for hiding this comment

Uh oh!

jsravn Feb 28, 2018

Choose a reason for hiding this comment

Uh oh!

bburket commented Mar 1, 2018

Uh oh!

jsravn commented Mar 1, 2018

Uh oh!

BenTheElder commented Mar 2, 2018

Uh oh!

k8s-ci-robot commented Mar 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcolgate commented Apr 17, 2018

Uh oh!

jsravn commented Apr 17, 2018

Uh oh!

m1093782566 commented Jun 6, 2018

Uh oh!

fejta-bot commented Sep 4, 2018

Uh oh!

BenTheElder commented Sep 4, 2018

Uh oh!

m1093782566 commented Sep 4, 2018

Uh oh!

BenTheElder commented Sep 4, 2018

Uh oh!

m1093782566 commented Sep 4, 2018

Uh oh!

m1093782566 commented Nov 29, 2018

Uh oh!

fejta-bot commented Feb 27, 2019

Uh oh!

jsravn commented Mar 1, 2019

Uh oh!

fejta-bot commented May 30, 2019

Uh oh!

jsravn commented May 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

jsravn commented Feb 20, 2018 •

edited

Loading

k8s-ci-robot commented Mar 2, 2018 •

edited

Loading