Skip to content

Fetch MTU from the nodeIP interface#10635

Merged
joestringer merged 1 commit intocilium:masterfrom
manuelbuil:mtuSolution
Jun 3, 2020
Merged

Fetch MTU from the nodeIP interface#10635
joestringer merged 1 commit intocilium:masterfrom
manuelbuil:mtuSolution

Conversation

@manuelbuil
Copy link
Copy Markdown
Contributor

@manuelbuil manuelbuil commented Mar 19, 2020

The detected MTU is always the one belonging to the interface that acts
as gateway. That interface does not necessarily need to be the one that
holds nodeIP and thus the one that Cilium will use to send traffic
inside the k8s cluster. This patch fixes this and detects the mtu of the
interface used for cluster communications

Fixes: #10309
Signed-off-by: Manuel Buil mbuil@suse.com

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Thanks for contributing!

The detected MTU is always the one belonging to the interface that acts
as gateway. That interface does not necessarily need to be the one that
holds nodeIP and thus the one that Cilium will use to send traffic
inside the k8s cluster. This patch fixes this and detects the mtu of the
interface used for cluster communications

Fixes: #10309

Autodetection of the mtu correctly detects the mtu of the interface used for the kubernetes cluster communication. The mtu was incorrectly detected in cases where multiple interfaces were present and the gateway interface was not the one used for kubernetes cluster communication

@manuelbuil manuelbuil requested review from a team March 19, 2020 15:44
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@joestringer
Copy link
Copy Markdown
Member

Travis failure seems relevant:

# github.com/cilium/cilium/daemon/cmd
vet: daemon/cmd/status_test.go:83:96: too few arguments in call to mtu.NewConfiguration
make[1]: *** [govet] Error 2
make[1]: Leaving directory `/home/travis/gopath/src/github.com/cilium/cilium'
make: *** [unit-tests] Error 2

@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@coveralls
Copy link
Copy Markdown

coveralls commented Mar 19, 2020

Coverage Status

Coverage increased (+0.03%) to 36.931% when pulling 9d985f1 on manuelbuil:mtuSolution into 601852b on cilium:master.

Copy link
Copy Markdown
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit. Otherwise looks good.

Comment thread pkg/mtu/detect_linux.go Outdated
Copy link
Copy Markdown
Member

@aanm aanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrfastab PTAL as well

Comment thread pkg/mtu/detect_linux.go Outdated
@aanm aanm requested a review from jrfastab March 20, 2020 09:31
@manuelbuil
Copy link
Copy Markdown
Contributor Author

test-me-please

2 similar comments
@aanm
Copy link
Copy Markdown
Member

aanm commented Mar 20, 2020

test-me-please

@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@vadorovsky
Copy link
Copy Markdown
Member

Travis hit the flake #10615. we can ignore it

@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

Comment thread daemon/cmd/daemon.go Outdated
Comment thread daemon/cmd/daemon.go Outdated
@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@vadorovsky
Copy link
Copy Markdown
Member

Hitting the flake #10882

@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

Copy link
Copy Markdown
Member

@vadorovsky vadorovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@vadorovsky
Copy link
Copy Markdown
Member

@tgraf PTAL

@vadorovsky
Copy link
Copy Markdown
Member

restart-ginkgo

@joestringer
Copy link
Copy Markdown
Member

@mrostecki Be aware there are at least three different types of issues with some of the services /nodeports tests , make sure you check the exit code and ensure it matches the corresponding issue. I can't tell if the failure seen before was #10882 because there's no link any more, but that one saw exit code 7 and was only seen in PRs that updated the cilium-runtime image so I suspect this saw a different issue, such as #10942 .

@joestringer
Copy link
Copy Markdown
Member

Yes, this looks to be the case:
https://jenkins.cilium.io/job/Cilium-PR-K8s-oldest-net-next/295/

Timeout has been exceeded

I'm aware there has been some provider issues combined with this particular job having more tests than the others which can lead to this situation. Given the code changes here I think it's fairly unlikely to actually introduce a kernel-specific regression on net-next; it's also passed all the other jobs so as long as we're fine with the latest code changes then this should be good to merge.

@joestringer
Copy link
Copy Markdown
Member

@manuelbuil are you still making any changes? If not, please mark "ready to review".

@manuelbuil
Copy link
Copy Markdown
Contributor Author

@manuelbuil are you still making any changes? If not, please mark "ready to review".

No, patch is ready to review. I can't add or remove any label, so not sure how to mark it as ready to review or change the draft status. Can you help me?

@manuelbuil manuelbuil marked this pull request as ready for review May 25, 2020 09:16
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit ac4bb90ab113c26e642764890e7d8522d75c70a1 does not contain "Signed-off-by".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper Bot added dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. and removed dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. labels May 25, 2020
@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@vadorovsky
Copy link
Copy Markdown
Member

https://jenkins.cilium.io/job/Cilium-PR-K8s-newest-kernel-4.9/358 failed with

12:38:10  • Failure [298.763 seconds]
12:38:10  K8sDatapathConfig
12:38:10  /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:436
12:38:10    Etcd
12:38:10    /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:436
12:38:10      Check connectivity [It]
12:38:10      /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/ginkgo-ext/scopes.go:471
12:38:10  
12:38:10      cilium pre-flight checks failed
[2020-05-25T10:38:10.986Z]     Expected
[2020-05-25T10:38:10.986Z]         <*errors.errorString | 0xc00025e130>: {
[2020-05-25T10:38:10.986Z]             s: "Cilium validation failed: 4m0s timeout expired: Last polled error: connectivity health is failing: Cluster connectivity is unhealthy on 'cilium-s7rdp': Exitcode: 255 \nStdout:\n \t \nStderr:\n \t Error: Cannot get status/probe: Put \"http://%2Fvar%2Frun%2Fcilium%2Fhealth.sock/v1beta/status/probe\": context deadline exceeded\n\t \n\t command terminated with exit code 255\n\t \n",
[2020-05-25T10:38:10.986Z]         }
[2020-05-25T10:38:10.986Z]     to be nil
12:38:10  
12:38:10      /home/jenkins/workspace/Cilium-PR-K8s-newest-kernel-4.9/k8s-1.18-gopath/src/github.com/cilium/cilium/test/helpers/manifest.go:313

Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the idea of taking the externalIP from k8s and using that to determine which device provides 'external' connectivity (and hence governs MTU) seems like a reasonable approach. I wish there were a better way to figure this out than iterating all IPs associated with all devices, but I'm not sure the kernel gives us any better APIs.

I have a bunch of minor nits that should be easy to address below.

Comment thread pkg/mtu/detect_linux.go Outdated
Comment thread pkg/mtu/detect_linux.go Outdated
Comment thread pkg/mtu/detect_linux.go Outdated
Comment thread pkg/mtu/mtu.go Outdated
Comment thread pkg/mtu/detect_linux.go Outdated
Comment thread daemon/cmd/daemon.go Outdated
Copy link
Copy Markdown
Contributor Author

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @joestringer

Comment thread pkg/mtu/mtu.go Outdated
Comment thread daemon/cmd/daemon.go Outdated
Comment thread pkg/mtu/detect_linux.go Outdated
@manuelbuil manuelbuil requested a review from a team as a code owner May 29, 2020 08:13
@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@aanm aanm requested a review from joestringer May 29, 2020 13:59
@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

The detected MTU is always the one belonging to the interface that acts
as gateway. That interface does not necessarily need to be the one that
holds nodeIP and thus the one that Cilium will use to send traffic
inside the k8s cluster. This patch fixes this and detects the mtu of the
interface used for cluster communications

The extra calculation of the correct MTU must be done before the daemon
is created and thus part of the K8s initialization code must be moved
before the daemon creation. The reason is that the Kubernetes IP for the
node is calculated during the K8s initialization.

Fixes: #10309

Signed-off-by: Manuel Buil <mbuil@suse.com>
@vadorovsky
Copy link
Copy Markdown
Member

test-me-please

@vadorovsky vadorovsky added kind/bug This is a bug in the Cilium logic. area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. labels Jun 3, 2020
@vadorovsky
Copy link
Copy Markdown
Member

I guess that the GKE test failure can be ignored.

@maintainer-s-little-helper maintainer-s-little-helper Bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 3, 2020
@joestringer joestringer merged commit 507d9b8 into cilium:master Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MTU autoDetect() fetches always the MTU from the gateway interface

10 participants