Skip to content

Consolidate network namespace handling#29993

Merged
ti-mo merged 3 commits intocilium:mainfrom
bleggett:bleggett/consolidate-netns-27449
Feb 29, 2024
Merged

Consolidate network namespace handling#29993
ti-mo merged 3 commits intocilium:mainfrom
bleggett:bleggett/consolidate-netns-27449

Conversation

@bleggett
Copy link
Copy Markdown
Contributor

@bleggett bleggett commented Dec 19, 2023

Fixes: #27449

This is the first step towards fixing #27449 and #27450 - Removing dependence on ip netns from the internal netns lib, and consolidating netns creation/handling into one lib/spot, while not requiring mount pins for every netns creation, the way we currently do - most things (incl. tests) rarely need this.

This PR removes a few netns libs from the repo:

  • containernetworking/plugins/pkg/ns
  • vishvananda/netns from all but one spot (pkg/datapath/linux/devices_controller.go, where a sister lib wants a typecast)

in favor of an internal lib "github.com/cilium/cilium/pkg/netns" which avoids pinning netnses, and only has a dep on the unix package.

General goals of the github.com/cilium/cilium/pkg/netns package:

  • To be used in all Cilium code that needs to create new netnses.
  • Supports creating unpinned netnses owned by Cilium code.
  • Supports running closures in netnses previously pinned outside of Cilium code, by other processes (needed for CNI)
  • No support for pinning netnses owned by Cilium code, as we never actually need to do that.
  • Stick to stdlib sys/unix

Looking for:

  • Feedback on pkg/netns API surface/impl/general approach

Generally the intent is to make sure that code that creates netnses is solely responsible for closing them.

Code that is running out-of-process from the process that is the netns owner sometimes does need to run some closure in the netns context, but should not be burdened with netns lifecycle concerns, or be responsible for closing/cleanup, in virtually all cases. The API surface is designed to discourage the current overreliance on named-pin-based disposal/cleanup.

cc @ti-mo to make sure I interpreted his intent correctly.

@bleggett bleggett requested review from a team as code owners December 19, 2023 23:05
@maintainer-s-little-helper maintainer-s-little-helper Bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 19, 2023
@github-actions github-actions Bot added the kind/community-contribution This was a contribution made by a community member. label Dec 19, 2023
@ti-mo ti-mo marked this pull request as draft December 20, 2023 12:32
@ti-mo ti-mo self-requested a review December 20, 2023 12:33
@ti-mo ti-mo changed the title [WIP] Initial step towards fixing #27449 Consolidate network namespace handling Dec 20, 2023
Copy link
Copy Markdown
Contributor

@ti-mo ti-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! I've renamed the PR to something more descriptive and left a few preliminary comments.

Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread pkg/netns/netns.go Outdated
Comment thread cilium-dbg/cmd/cleanup.go Outdated
Comment thread pkg/datapath/linux/node_linux_test.go Outdated
@ti-mo ti-mo added kind/enhancement This would improve or streamline existing functionality. release-note/misc This PR makes changes that have no direct user impact. labels Dec 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper Bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 20, 2023
@bleggett bleggett force-pushed the bleggett/consolidate-netns-27449 branch 3 times, most recently from b0b40c3 to ea1d9eb Compare January 9, 2024 23:29
@bleggett bleggett marked this pull request as ready for review January 9, 2024 23:33
@bleggett bleggett requested a review from a team as a code owner January 9, 2024 23:33
@bleggett bleggett requested review from sayboras and ti-mo January 9, 2024 23:33
@bleggett
Copy link
Copy Markdown
Contributor Author

bleggett commented Jan 9, 2024

Back from break, should be good for review.

Still working on test failures.

@bleggett bleggett force-pushed the bleggett/consolidate-netns-27449 branch from 64309b4 to a6c76f2 Compare January 10, 2024 22:15
@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 5, 2024

/test

@bleggett
Copy link
Copy Markdown
Contributor Author

bleggett commented Feb 8, 2024

Resynced w/main - I'm good if you all are.

Comment thread cilium-dbg/cmd/post_uninstall_cleanup.go
Comment thread cilium-health/launch/endpoint.go Outdated
Comment thread plugins/cilium-cni/cmd/cmd.go
squeed
squeed previously requested changes Feb 13, 2024
Copy link
Copy Markdown
Contributor

@squeed squeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks basically good; I'd like to see it split in to a few more commits. One for the new API, one to move all test code, one for trivial changes, non-trivial changes in individual commits, and lastly one to remove old code. It would make reviewing a lot easier. Thanks!

Comment thread cilium-health/launch/endpoint.go
Comment thread cilium-health/launch/endpoint.go
@maintainer-s-little-helper
Copy link
Copy Markdown

Commits cad2332, 0d0b0b9, e979d13 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@bleggett
Copy link
Copy Markdown
Contributor Author

Looks basically good; I'd like to see it split in to a few more commits. One for the new API, one to move all test code, one for trivial changes, non-trivial changes in individual commits, and lastly one to remove old code. It would make reviewing a lot easier. Thanks!

@squeed I at least split into 3, that helps a bit:

  • Impl + tests
  • Non-test changes
  • Deleting old

(mergability check will fail but HTH, ty)

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 20, 2024

@bleggett Thanks for picking this back up, I fell sick a few weeks ago and didn't manage to continue.

I'm taking another look at this now, there's an issue with makeNetlinkFuncs() that's causing unit tests to fail and presumably also impacts the agent. The netlinkFuncs.*Subscribe members are called lazily, so makeNetlinkFuncs needs to take out a reference to the current netns and stuff it into each of the closures. This needs to be an actual NetNS reference so it doesn't get GC'd.

Also, the Build Commits workflow requires make build works for all commits individually, so I think the middle 2 commits need to be squashed together.

I'll push this up soon.

Edit: I've added the netns.Current() function to expose getCurrent() for use in makeNetlinkFuncs().

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 20, 2024

/test

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 20, 2024

The Build Commits workflow was still failing with the containernetworking removal split off as a separate commit, since it left two implementations of GetNetNSCookie in the implementation commit. I've squashed it back together and left the cilium-dbg cleanup code removal as a separate commit. That's the only thing that stood out as being separable from the other changes.

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 20, 2024

/test

1 similar comment
@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 20, 2024

/test

@bleggett
Copy link
Copy Markdown
Contributor Author

@bleggett Thanks for picking this back up, I fell sick a few weeks ago and didn't manage to continue.

I'm taking another look at this now, there's an issue with makeNetlinkFuncs() that's causing unit tests to fail and presumably also impacts the agent. The netlinkFuncs.*Subscribe members are called lazily, so makeNetlinkFuncs needs to take out a reference to the current netns and stuff it into each of the closures. This needs to be an actual NetNS reference so it doesn't get GC'd.

Edit: I've added the netns.Current() function to expose getCurrent() for use in makeNetlinkFuncs().

Ah yep, this was an overzealous attempt to fully drop the final explicit "github.com/vishvananda/netns" dep. I'd initially left it like it is currently (explicitly taking the current and wrapping it in a github.com/vishvananda/netns.NetNS handle, as I wasn't sure it would take the right netns), then removed it and didn't see any local failures. I'm fine putting it back like this, it's how it was originally anyway.

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 21, 2024

@bleggett Already done, take a look at the current pkg/datapath/linux/devices_controller.go.

Looks like CI is still failing even after fixing makeNetlinkFuncs in ci-runtime. Do you have any cycles to investigate the remaining failures?

@bleggett
Copy link
Copy Markdown
Contributor Author

bleggett commented Feb 21, 2024

@bleggett Already done, take a look at the current pkg/datapath/linux/devices_controller.go.

Looks like CI is still failing even after fixing makeNetlinkFuncs in ci-runtime. Do you have any cycles to investigate the remaining failures?

👍 Feel free to beat me to it if you want, otherwise I'll get to it in a day or two.

Copy link
Copy Markdown
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ipam changes lgtm

@bleggett
Copy link
Copy Markdown
Contributor Author

bleggett commented Feb 27, 2024

Can I get someone to /test?

I can't repro any unit test failures vis a vis main but need to check the conformance suites.

@youngnick
Copy link
Copy Markdown
Contributor

/test

@bleggett
Copy link
Copy Markdown
Contributor Author

@ti-mo et al - tests seem good to me locally, and the conformance tests are passing now.

I note that one each of the gateway API and ginkgo subtests are failing, I do not think that's related to this, but am happy to be corrected.

This should be good to merge if everyone is happy.

@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 29, 2024

@bleggett Thanks for taking a look! What did you end up changing to address the failures?

Rebasing onto main to pick up #31017 for https://github.com/cilium/cilium/actions/runs/8072988990/job/22055814369.

ti-mo and others added 3 commits February 29, 2024 13:07
This allows surfacing unexpected errors, but ignores interfaces that
are already absent.

Signed-off-by: Timo Beckers <timo@isovalent.com>
This was already redundant when running Cilium inside a container, since
the nsfs instance inside the container is bound to the container's
lifecycle. Running Cilium outside of a container is currently rather
involved and not officially supported. Remove the netns cleanup code.

When the container exits, cilium-health and its enclosing namespace also
disappears.

Follow-up commits will remove the code in package netns being called here.

Signed-off-by: Timo Beckers <timo@isovalent.com>
The previous netns package had a few problems. It shelled out to iproute2,
and it depended on both containernetworking/plugins/pkg/ns and vishvananda/netns,
which lead to some idiosyncratic API.

This commit addresses these issues and takes care of some much-needed API cleanup:

- Create a new netns with New()
- Open an existing pinned netns with OpenPinned()
- Execute code within the netns with ns.Do()
- Close with Close()

Pinning network namespaces is not supported, as there is currently little reason
for doing so. In case the requirement pops up again later, it can always be added.

All tests now use anonymous (non-pinned) network namespaces, and the netns created
for cilium-health also no longer leaves an entry in /var/run/netns.

Signed-off-by: Benjamin Leggett <benjamin.leggett@solo.io>
Co-authored-by: Timo Beckers <timo@isovalent.com>
@ti-mo
Copy link
Copy Markdown
Contributor

ti-mo commented Feb 29, 2024

/test

Copy link
Copy Markdown
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from vendor team

@bleggett
Copy link
Copy Markdown
Contributor Author

@bleggett Thanks for taking a look! What did you end up changing to address the failures?

Rebasing onto main to pick up #31017 for https://github.com/cilium/cilium/actions/runs/8072988990/job/22055814369.

Absolutely nothing, just resynced to main 😅

@bleggett
Copy link
Copy Markdown
Contributor Author

TY all - apologies this took a bit of time 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/community-contribution This was a contribution made by a community member. kind/enhancement This would improve or streamline existing functionality. release-note/misc This PR makes changes that have no direct user impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consolidate code for network namespace interactions

10 participants