-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
bpf: pid1: Pin reference to BPF programs for post-coldplug #17495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7486807 to
f6d89a0
Compare
|
Since I'm unfamiliar with this area of cgroups+bpf, I'm not going to comment/review beyond obvious programming errors, which seem to be addressed AFAICT. |
keszybz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK.
|
This doesn't look right. While it addresses "daemon-reload". It doesn't fix things on "daemon-exec." This needs to be solved differently: the bpf fds need to be serialized and deserialized. i.e. the serializing functions have the "FDSet" parameter that fds can be added to that shall be passed across the daemon-reexec/daemon-reload. The BPF attachmet fds need to be added there, and the nr serialized. Then after coming back the fd must be pulled out of the fdset the deserializing function get handed in. |
08cd2de to
5d208f9
Compare
9a7ddd4 to
0447b6c
Compare
0447b6c to
980dd3f
Compare
4d0f02d to
22142e7
Compare
yuwata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure for the implementation detail. Several nitpicky comments.
22142e7 to
48e43ce
Compare
4e3e475 to
1b1a2c9
Compare
Hey @keszybz, in case it got lost, could you clarify this? I have this running on all FB servers and BPF programs pinning for reexec works just fine with the FD passing as far as I can see: |
During `daemon-reload` and `daemon-reexec`, we detach and reattach all
BPF programs attached to cgroups. This, however, poses a real practical
problem for DevicePolicy (and some other settings using BPF): it
presents a period of time where the old device filtering BPF program has
been unloaded, but the new one has not been loaded yet.
Since the filtering is at open() time, it has become apparent that that
there's a non-trivial period where applications inside that ostensibly
filtered cgroup can grab any device -- and often do so -- and then
retain access to that device even after the reload is over. Due to the
file continuing to be available after the initial open(), this issue is
particularly visible for DevicePolicy={strict,closed}, however it also
applies to other BPF programs we install.
In particular, for BPF ingress/egress filtering this may have more
concerning implications: network traffic which is supposed to be
filtered will -- for a very brief period of time -- not be filtered or
subject to any restrictions imposed by BPF.
These BPF programs are fundamentally attached to a cgroup lifetime, not
our unit lifetime, so it's enough to pin these programs by taking a
reference to affected BPF programs before reload/reexec. We can then
serialise the program's kernel-facing FD and cgroup attachment FD for
the new daemon, and have the daemon on the other side unpin the programs
after it's finished with coldplug.
That means that, for example, the BPF program lifecycle during
daemon-reload or daemon-reexec changes from this:
manager_clear_jobs_and_units
│
╔══════╪═════════╤═══════╗
║ prog │ no prog │ prog' ║
╚══════╧═════════╪═══════╝
│
manager_coldplug
to this:
manager_clear_jobs_and_units manager_dispatch_cgroup_realize_queue
│ │
╔══════╪═══════════════╤═══════════════════════╪═══════╗
║ prog │ prog (orphan) │ prog (orphan) + prog' │ prog' ║
╚══════╧═══════════════╪═══════════════════════╧═══════╝
│
manager_coldplug
For daemon-reexec the semantics are mostly the same, but the point at
which the program becomes orphan is tied to the process lifecycle
instead.
None of the BPF programs we install require exclusive access, so having
multiple instances of them running at the same time is fine. Custom
programs, of course, are unknown, but it's hard to imagine legitimate
cases which should be affected, whereas the benefits of this "overlap"
approach with reference pinning is immediately tangible.
[keszybz: use _cleanup_ for unpin, use FOREACH_POINTER]
1b1a2c9 to
4e42210
Compare
|
@mrc0mmand Hey! My only guess is the Arch failure is due to this: I don't know why leak sanitiser should fire in dbus-daemon of all places, or why that should cause us to kill init, since (as far as I can tell...) this doesn't touch its code it inherits from libsystemd at all, but maybe the new behaviour tickles something in just the right way. I'm still unable to repro, so any chance of running I also notice we already ignore dbus-daemon memory leaks in some other parts of the CI infra, although I don't know the history around that. |
You can safely ignore the dbus message (it happens in all runs when dbus is shutting down and it doesn't affect the job in any way, at least from what I've seen). However, I'll look into it anyway to get rid of the error. The main culprit here is the kernel panic at the end: Unfortunately, I have no idea how to debug this in some meaningful way, since neither the journal nor the console log show anything useful (but that may be just my inexperience with kernel debugging). |
|
Thanks. So far I've not been able to repro this panic under ASAN/UBSAN, although I assume systemd is dying from some sanitizer failure. I also don't see anything obvious from the code, although maybe it's more insidious. I'll keep trying though. |
I played around a bit and disabled LSan completely for the dbus service. However, even with dbus.service shutting down correctly, the kernel panic is still there: |
|
Hmpf, it looks like the kernel panic is caused by a watchdog timing out: |
| Manager* manager_pin_all_cgroup_bpf_programs(Manager *m); | ||
| void manager_unpin_all_cgroup_bpf_programs(Manager *m); | ||
| DEFINE_TRIVIAL_CLEANUP_FUNC(Manager*, manager_unpin_all_cgroup_bpf_programs); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if *_pin_* is the best naming here. BPF has a pinning concept that is different from the one used here.
I got confused at the beginning because I thought manager_pin_all_cgroup_bpf_programs was pinning BPF objects to the BPF File System but it's doing a completely different thing.
Werkov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detaching the programs on reload looks really like a bug (assuming the programs are meant for restrictions). The proposed approach looks better, although it'd be tilted a bit to the other side -- there'll be a short period of time when cgroup is more restricted until the old prog is detached. If only there was a way to swap a prog, BPF_F_REPLACE maybe? (Note this suggestion isn't part of the requested changes, those are for the inline comments.)
| continue; | ||
| } | ||
|
|
||
| bpf_program_ref(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference should only be taken on actual insertion into the set, shouldn't it?
(I guess this works so far, because one BPFProgram can only be attached to single unit only but I wouldn't rely on it here.)
| continue; | ||
| } | ||
|
|
||
| bpf_program_ref(p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference should only be taken on actual insertion into the set, shouldn't it?
| return 0; | ||
| } | ||
|
|
||
| void bpf_program_skeletonize(BPFProgram *p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(LOL, the name :-)
IIUC, this is subset of bpf_program_free. Wouldn't it be more maintanable to call this function from bpf_program_free instead of repeated code?
|
|
||
| flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI && | ||
| (u->type == UNIT_SLICE || unit_cgroup_delegate(u))) ? BPF_F_ALLOW_MULTI : 0; | ||
| flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI) ? BPF_F_ALLOW_MULTI : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this deserves comment like "go with BPF_F_ALLOW_MULTI because we may encounter a program from before reload".
BTW I think this should also add to NEWS (and commit log) that minimum kernel version is 4.15, otherwise daemon-reload wouldn't refresh once assigned program.
Actually, when I think about it further, it makes the BPF_FIREWALL_SUPPORTED_WITH_MULTI useless (pick one: daemon-reload or attached up-to-date BPF progs). Or am I too pessimistic here?
|
Thanks @mauriciovasquezbernal and @Werkov for the feedback -- I'm a bit stuck working on some other stuff at the moment but will rebase and come back to this in due course :-) |
|
Pushing forward to v249 milestone, given we are already in -rc3 stage so a bit late for full features. We should really sort the BPF story in v249 though :-) |
|
Closing, as #19851 was merged. |
During
daemon-reloadanddaemon-reexec, we detach and reattach all BPF programs attached to cgroups. This, however, poses a real practical problem for DevicePolicy (and some other settings using BPF): it presents a period of time where the old device filtering BPF program has been unloaded, but the new one has not been loaded yet.Since the filtering is at open() time, it has become apparent that that there's a non-trivial period where applications inside that ostensibly filtered cgroup can grab any device -- and often do so -- and then retain access to that device even after the reload is over. Due to the file continuing to be available after the initial open(), this issue is particularly visible for DevicePolicy={strict,closed}, however it also applies to other BPF programs we install.
In particular, for BPF ingress/egress filtering this may have more concerning implications: network traffic which is supposed to be filtered will -- for a very brief period of time -- not be filtered or subject to any restrictions imposed by BPF.
These BPF programs are fundamentally attached to a cgroup lifetime, not our unit lifetime, so it's enough to pin these programs by taking a reference to affected BPF programs before reload/reexec. We can then serialise the program's kernel-facing FD and cgroup attachment FD for the new daemon, and have the daemon on the other side unpin the programs after it's finished with coldplug.
That means that, for example, the BPF program lifecycle during daemon-reload or daemon-reexec changes from this:
to this:
For daemon-reexec the semantics are mostly the same, but the point at which the program becomes orphan is tied to the process lifecycle instead.
None of the BPF programs we install require exclusive access, so having multiple instances of them running at the same time is fine. Custom programs, of course, are unknown, but it's hard to imagine legitimate cases which should be affected, whereas the benefits of this "overlap" approach with reference pinning is immediately tangible.