Skip to content

Conversation

@cdown
Copy link
Member

@cdown cdown commented Oct 30, 2020

During daemon-reload and daemon-reexec, we detach and reattach all BPF programs attached to cgroups. This, however, poses a real practical problem for DevicePolicy (and some other settings using BPF): it presents a period of time where the old device filtering BPF program has been unloaded, but the new one has not been loaded yet.

Since the filtering is at open() time, it has become apparent that that there's a non-trivial period where applications inside that ostensibly filtered cgroup can grab any device -- and often do so -- and then retain access to that device even after the reload is over. Due to the file continuing to be available after the initial open(), this issue is particularly visible for DevicePolicy={strict,closed}, however it also applies to other BPF programs we install.

In particular, for BPF ingress/egress filtering this may have more concerning implications: network traffic which is supposed to be filtered will -- for a very brief period of time -- not be filtered or subject to any restrictions imposed by BPF.

These BPF programs are fundamentally attached to a cgroup lifetime, not our unit lifetime, so it's enough to pin these programs by taking a reference to affected BPF programs before reload/reexec. We can then serialise the program's kernel-facing FD and cgroup attachment FD for the new daemon, and have the daemon on the other side unpin the programs after it's finished with coldplug.

That means that, for example, the BPF program lifecycle during daemon-reload or daemon-reexec changes from this:

manager_clear_jobs_and_units
             │
      ╔══════╪═════════╤═══════╗
      ║ prog │ no prog │ prog' ║
      ╚══════╧═════════╪═══════╝
                       │
                manager_coldplug

to this:

manager_clear_jobs_and_units         manager_dispatch_cgroup_realize_queue
             │                                       │
      ╔══════╪═══════════════╤═══════════════════════╪═══════╗
      ║ prog │ prog (orphan) │ prog (orphan) + prog' │ prog' ║
      ╚══════╧═══════════════╪═══════════════════════╧═══════╝
                             │
                          manager_coldplug

For daemon-reexec the semantics are mostly the same, but the point at which the program becomes orphan is tied to the process lifecycle instead.

None of the BPF programs we install require exclusive access, so having multiple instances of them running at the same time is fine. Custom programs, of course, are unknown, but it's hard to imagine legitimate cases which should be affected, whereas the benefits of this "overlap" approach with reference pinning is immediately tangible.

@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 7486807 to f6d89a0 Compare October 30, 2020 19:05
@vcaputo
Copy link
Member

vcaputo commented Oct 30, 2020

Since I'm unfamiliar with this area of cgroups+bpf, I'm not going to comment/review beyond obvious programming errors, which seem to be addressed AFAICT.

@systemd systemd deleted a comment Oct 30, 2020
@keszybz keszybz added this to the v247 milestone Nov 3, 2020
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK.

@poettering
Copy link
Member

This doesn't look right. While it addresses "daemon-reload". It doesn't fix things on "daemon-exec."

This needs to be solved differently: the bpf fds need to be serialized and deserialized. i.e. the serializing functions have the "FDSet" parameter that fds can be added to that shall be passed across the daemon-reexec/daemon-reload. The BPF attachmet fds need to be added there, and the nr serialized. Then after coming back the fd must be pulled out of the fdset the deserializing function get handed in.

@poettering poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 4, 2020
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 08cd2de to 5d208f9 Compare November 12, 2020 13:49
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 9a7ddd4 to 0447b6c Compare November 12, 2020 14:00
@cdown cdown requested a review from yuwata November 12, 2020 14:00
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 0447b6c to 980dd3f Compare November 12, 2020 14:06
@cdown cdown removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 12, 2020
@cdown cdown linked an issue Nov 12, 2020 that may be closed by this pull request
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 4d0f02d to 22142e7 Compare November 13, 2020 01:57
Copy link
Member

@yuwata yuwata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure for the implementation detail. Several nitpicky comments.

@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 22142e7 to 48e43ce Compare November 13, 2020 12:01
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 4e3e475 to 1b1a2c9 Compare December 11, 2020 21:57
@cdown
Copy link
Member Author

cdown commented Jan 7, 2021

I agree that it should. But right now the pinning is only done in the path for reload, not reexec.

Hey @keszybz, in case it got lost, could you clarify this? I have this running on all FB servers and BPF programs pinning for reexec works just fine with the FD passing as far as I can see:

[root@ktst ~]# strace -Tfe trace=bpf -p 1  # during daemon-reexec
strace: Process 1 attached
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=62, insns=0x557bb8092d70, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 25 <0.004330>
bpf(BPF_PROG_ATTACH, {target_fd=27, attach_bpf_fd=25, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000067>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8067f40, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 27 <0.000461>
bpf(BPF_PROG_ATTACH, {target_fd=49, attach_bpf_fd=27, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000060>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb80496c0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 49 <0.000407>
bpf(BPF_PROG_ATTACH, {target_fd=50, attach_bpf_fd=49, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000060>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=63, insns=0x557bb8093220, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 50 <0.005119>
bpf(BPF_PROG_ATTACH, {target_fd=54, attach_bpf_fd=50, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000101>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8003620, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 54 <0.000505>
bpf(BPF_PROG_ATTACH, {target_fd=55, attach_bpf_fd=54, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000065>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8061fe0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 55 <0.000430>
bpf(BPF_PROG_ATTACH, {target_fd=56, attach_bpf_fd=55, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000061>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=87, insns=0x557bb80937a0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 56 <0.008111>
bpf(BPF_PROG_ATTACH, {target_fd=57, attach_bpf_fd=56, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000063>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb804b2f0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 57 <0.000433>
bpf(BPF_PROG_ATTACH, {target_fd=58, attach_bpf_fd=57, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000082>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb804b1a0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 58 <0.000466>
bpf(BPF_PROG_ATTACH, {target_fd=59, attach_bpf_fd=58, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000070>
bpf(BPF_PROG_DETACH, {target_fd=59, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000063>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000086>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000059>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000062>
bpf(BPF_PROG_DETACH, {target_fd=26, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000059>
bpf(BPF_PROG_DETACH, {target_fd=26, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000066>

@anitazha anitazha removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jan 14, 2021
During `daemon-reload` and `daemon-reexec`, we detach and reattach all
BPF programs attached to cgroups. This, however, poses a real practical
problem for DevicePolicy (and some other settings using BPF): it
presents a period of time where the old device filtering BPF program has
been unloaded, but the new one has not been loaded yet.

Since the filtering is at open() time, it has become apparent that that
there's a non-trivial period where applications inside that ostensibly
filtered cgroup can grab any device -- and often do so -- and then
retain access to that device even after the reload is over. Due to the
file continuing to be available after the initial open(), this issue is
particularly visible for DevicePolicy={strict,closed}, however it also
applies to other BPF programs we install.

In particular, for BPF ingress/egress filtering this may have more
concerning implications: network traffic which is supposed to be
filtered will -- for a very brief period of time -- not be filtered or
subject to any restrictions imposed by BPF.

These BPF programs are fundamentally attached to a cgroup lifetime, not
our unit lifetime, so it's enough to pin these programs by taking a
reference to affected BPF programs before reload/reexec. We can then
serialise the program's kernel-facing FD and cgroup attachment FD for
the new daemon, and have the daemon on the other side unpin the programs
after it's finished with coldplug.

That means that, for example, the BPF program lifecycle during
daemon-reload or daemon-reexec changes from this:

    manager_clear_jobs_and_units
                 │
          ╔══════╪═════════╤═══════╗
          ║ prog │ no prog │ prog' ║
          ╚══════╧═════════╪═══════╝
                           │
                    manager_coldplug

to this:

    manager_clear_jobs_and_units         manager_dispatch_cgroup_realize_queue
                 │                                       │
          ╔══════╪═══════════════╤═══════════════════════╪═══════╗
          ║ prog │ prog (orphan) │ prog (orphan) + prog' │ prog' ║
          ╚══════╧═══════════════╪═══════════════════════╧═══════╝
                                 │
                          manager_coldplug

For daemon-reexec the semantics are mostly the same, but the point at
which the program becomes orphan is tied to the process lifecycle
instead.

None of the BPF programs we install require exclusive access, so having
multiple instances of them running at the same time is fine. Custom
programs, of course, are unknown, but it's hard to imagine legitimate
cases which should be affected, whereas the benefits of this "overlap"
approach with reference pinning is immediately tangible.

[keszybz: use _cleanup_ for unpin, use FOREACH_POINTER]
@cdown cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 1b1a2c9 to 4e42210 Compare January 18, 2021 15:20
@cdown
Copy link
Member Author

cdown commented Jan 19, 2021

@mrc0mmand Hey! My only guess is the Arch failure is due to this:

Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==LeakSanitizer has encountered a fatal error.
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

I don't know why leak sanitiser should fire in dbus-daemon of all places, or why that should cause us to kill init, since (as far as I can tell...) this doesn't touch its code it inherits from libsystemd at all, but maybe the new behaviour tickles something in just the right way.

I'm still unable to repro, so any chance of running dbus.service with LSAN_OPTIONS=verbosity=1:log_threads=1 to see what's going on here? Thanks!

I also notice we already ignore dbus-daemon memory leaks in some other parts of the CI infra, although I don't know the history around that.

@mrc0mmand
Copy link
Member

@mrc0mmand Hey! My only guess is the Arch failure is due to this:

Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==LeakSanitizer has encountered a fatal error.
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

I don't know why leak sanitiser should fire in dbus-daemon of all places, or why that should cause us to kill init, since (as far as I can tell...) this doesn't touch its code it inherits from libsystemd at all, but maybe the new behaviour tickles something in just the right way.

I'm still unable to repro, so any chance of running dbus.service with LSAN_OPTIONS=verbosity=1:log_threads=1 to see what's going on here? Thanks!

You can safely ignore the dbus message (it happens in all runs when dbus is shutting down and it doesn't affect the job in any way, at least from what I've seen). However, I'll look into it anyway to get rid of the error.

The main culprit here is the kernel panic at the end:

[�[0;32m  OK  �[0m] Reached target �[0;1;39mPower-Off�[0m.
[  457.730019] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  457.732696] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  457.732696] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  457.732696] Call Trace:
[  457.732696]  dump_stack+0x6b/0x83
[  457.732696]  panic+0x112/0x2e8
[  457.732696]  do_exit.cold+0x2c/0xb3
[  457.732696]  do_group_exit+0x33/0xa0
[  457.732696]  __x64_sys_exit_group+0x14/0x20
[  457.732696]  do_syscall_64+0x33/0x40
[  457.732696]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  457.732696] RIP: 0033:0x7f6b4803e5ee
[  457.732696] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  457.732696] RSP: 002b:00007ffdc5e56d90 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  457.732696] RAX: ffffffffffffffda RBX: 00007f6b481d86b8 RCX: 00007f6b4803e5ee
[  457.732696] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  457.732696] RBP: 00007f6b481d86b8 R08: 00007f6b4151e56b R09: 00007ffdc5e56a00
[  457.732696] R10: 0000000000000043 R11: 0000000000000206 R12: 00000fed685fa048
[  457.732696] R13: 00007f6b42fd01f0 R14: 00007f6b42fd0240 R15: 00007f6b42fd01f0
[  457.732696] Kernel Offset: 0x10600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  457.732696] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Unfortunately, I have no idea how to debug this in some meaningful way, since neither the journal nor the console log show anything useful (but that may be just my inexperience with kernel debugging).

@cdown
Copy link
Member Author

cdown commented Jan 19, 2021

Thanks. So far I've not been able to repro this panic under ASAN/UBSAN, although I assume systemd is dying from some sanitizer failure. I also don't see anything obvious from the code, although maybe it's more insidious. I'll keep trying though.

@mrc0mmand
Copy link
Member

mrc0mmand commented Jan 20, 2021

Thanks. So far I've not been able to repro this panic under ASAN/UBSAN, although I assume systemd is dying from some sanitizer failure. I also don't see anything obvious from the code, although maybe it's more insidious. I'll keep trying though.

I played around a bit and disabled LSan completely for the dbus service. However, even with dbus.service shutting down correctly, the kernel panic is still there:

[  OK  ] Finished Load/Save Random Seed.
[    7.006899] sh[282]: + systemctl daemon-reload
[    7.838122] sh[282]: + echo OK
[  OK  ] Finished TEST-01-BASIC.
[  OK  ] Reached target Testsuite target.
         Starting End the test...
[    7.871130] sh[315]: + systemctl poweroff --no-block
[  OK  ] Stopped End the test.
[  OK  ] Removed slice system-modprobe.slice.
[  OK  ] Stopped target Testsuite target.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped target Login Prompts.
[  OK  ] Stopped target Timers.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
         Stopping D-Bus System Message Bus...
         Stopping Serial Getty on ttyS0...
         Stopping User Login Management...
         Stopping Load/Save Random Seed...
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped Serial Getty on ttyS0.
[  OK  ] Removed slice system-serial\x2dgetty.slice.
         Stopping Permit User Sessions...
[  OK  ] Stopped User Login Management.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped target Slices.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped target Local Encrypted Volumes.
[  OK  ] Stopped Dispatch Password …ts to Console Directory Watch.
[  OK  ] Stopped Forward Password R…uests to Wall Directory Watch.
[  OK  ] Stopped target Local Verity Integrity Protected Volumes.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Update is Completed.
[  OK  ] Stopped Rebuild Dynamic Linker Cache.
         Stopping Update UTMP about System Boot/Shutdown...
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
[  OK  ] Stopped target Local File Systems (Pre).
         Unmounting Temporary Directory (/tmp)...
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Unmounted Temporary Directory (/tmp).
[  OK  ] Stopped target Swap.
[  OK  ] Reached target Shutdown.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Power-Off.
[  OK  ] Reached target Power-Off.
[  510.746284] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  510.746960] CPU: 3 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  510.746960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  510.746960] Call Trace:
[  510.746960]  dump_stack+0x6b/0x83
[  510.746960]  panic+0x112/0x2e8
[  510.746960]  do_exit.cold+0x2c/0xb3
[  510.746960]  do_group_exit+0x33/0xa0
[  510.746960]  __x64_sys_exit_group+0x14/0x20
[  510.746960]  do_syscall_64+0x33/0x40
[  510.746960]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  510.746960] RIP: 0033:0x7fb3aaea25ee
[  510.746960] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  510.746960] RSP: 002b:00007ffd6f99ef90 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  510.746960] RAX: ffffffffffffffda RBX: 00007fb3ab03c6b8 RCX: 00007fb3aaea25ee
[  510.746960] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  510.746960] RBP: 00007fb3ab03c6b8 R08: 00007fb3a435256b R09: 00007ffd6f99ec00
[  510.746960] R10: 0000000000000043 R11: 0000000000000206 R12: 00000ff674bcda48
[  510.746960] R13: 00007fb3a5e6d1f0 R14: 00007fb3a5e6d240 R15: 00007fb3a5e6d1f0
[  510.746960] Kernel Offset: 0xe00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  510.746960] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
qemu-system-x86_64: terminating on signal 15 from pid 79410 (timeout)
E: test timed out after 1000 s

@mrc0mmand
Copy link
Member

mrc0mmand commented Jan 20, 2021

Hmpf, it looks like the kernel panic is caused by a watchdog timing out:

[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Power-Off.
[  OK  ] Reached target Power-Off.
[  146.978652] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Connection refused
[  216.978127] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  316.978167] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  396.978375] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  462.645989] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  462.648636] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  462.648636] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  462.648636] Call Trace:
[  462.648636]  dump_stack+0x6b/0x83
[  462.648636]  panic+0x112/0x2e8
[  462.648636]  do_exit.cold+0x2c/0xb3
[  462.648636]  do_group_exit+0x33/0xa0
[  462.648636]  __x64_sys_exit_group+0x14/0x20
[  462.648636]  do_syscall_64+0x33/0x40
[  462.648636]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  462.648636] RIP: 0033:0x7f0ac3e265ee
[  462.648636] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  462.648636] RSP: 002b:00007fffaabee950 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  462.648636] RAX: ffffffffffffffda RBX: 00007f0ac3fc06b8 RCX: 00007f0ac3e265ee
[  462.648636] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  462.648636] RBP: 00007f0ac3fc06b8 R08: 00007f0abe28156b R09: 00007fffaabee5c0
[  462.648636] R10: 0000000000000043 R11: 0000000000000206 R12: 00000fe157db1048
[  462.648636] R13: 00007f0abed881f0 R14: 00007f0abed88240 R15: 00007f0abed881f0
[  462.648636] Kernel Offset: 0x16a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  462.648636] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
[  OK  ] Reached target Shutdown.
[   95.187592] systemd[1]: final.target: starting held back, waiting for: umount.target
[   95.200709] systemd[1]: umount.target changed dead -> active
[   95.210242] systemd[1]: umount.target: Job 217 umount.target/start finished, result=done
[   95.224371] systemd[1]: Reached target Unmount All Filesystems.
[  OK  ] Reached target Unmount All Filesystems.
[   95.241724] systemd[1]: final.target changed dead -> active
[   95.249911] systemd[1]: final.target: Job 216 final.target/start finished, result=done
[   95.261941] systemd[1]: Reached target Final Step.
[  OK  ] Reached target Final Step.
[   95.275956] systemd[1]: systemd-poweroff.service: Changed dead -> start
[   95.289575] systemd[1]: systemd-poweroff.service: Succeeded.
[   95.299939] systemd[1]: systemd-poweroff.service: Service will not restart (restart setting)
[   95.314146] systemd[1]: systemd-poweroff.service: Changed start -> dead
[   95.327216] systemd[1]: systemd-poweroff.service: Job 136 systemd-poweroff.service/start finished, result=done
[   95.344233] systemd[1]: Finished Power-Off.
[  OK  ] Finished Power-Off.
[   95.358169] systemd[1]: Forcibly powering off: unit systemd-poweroff.service succeeded
[   95.358538] audit: type=1130 audit(1611431828.120:92): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-poweroff comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   95.372553] systemd[1]: poweroff.target changed dead -> active
[   95.399245] audit: type=1131 audit(1611431828.120:93): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-poweroff comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   95.407186] systemd[1]: poweroff.target: Job 135 poweroff.target/start finished, result=done
[   95.447893] systemd[1]: Reached target Power-Off.
[  OK  ] Reached target Power-Off.
[   95.462562] systemd[1]: Shutting down.
[   95.493940] audit: type=1334 audit(1611431828.256:94): prog-id=20 op=UNLOAD
[   95.504388] audit: type=1334 audit(1611431828.256:95): prog-id=19 op=UNLOAD
[   95.555764] audit: type=1334 audit(1611431828.320:96): prog-id=22 op=UNLOAD
[   95.563093] systemd[1]: systemd-journald.service: Releasing resources.
[   95.567240] audit: type=1334 audit(1611431828.320:97): prog-id=21 op=UNLOAD
[   95.577412] systemd[1]: systemd-journald.service: Releasing all stored fds
[  154.937590] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Connection refused
[  154.952996] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Connection refused
[  224.937099] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  224.955917] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  324.937143] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  324.962773] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  404.937281] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  404.956226] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  514.937139] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  514.955556] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  522.679258] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  522.681922] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.9-arch1-1 #1
[  522.681922] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  522.681922] Call Trace:
[  522.681922]  dump_stack+0x6b/0x83
[  522.681922]  panic+0x112/0x2e8
[  522.681922]  do_exit.cold+0x2c/0xb3
[  522.681922]  do_group_exit+0x33/0xa0
[  522.681922]  __x64_sys_exit_group+0x14/0x20
[  522.681922]  do_syscall_64+0x33/0x40
[  522.681922]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  522.681922] RIP: 0033:0x7f8126ad65ee
[  522.681922] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  522.681922] RSP: 002b:00007ffecc6d2290 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  522.681922] RAX: ffffffffffffffda RBX: 00007f8126c706b8 RCX: 00007f8126ad65ee
[  522.681922] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  522.681922] RBP: 00007f8126c706b8 R08: 00007f811ebee56b R09: 00007ffecc6d1f00
[  522.681922] R10: 0000000000000043 R11: 0000000000000206 R12: 00000ff024346048
[  522.681922] R13: 00007f8121a301f0 R14: 00007f8121a30240 R15: 00007f8121a301f0
[  522.681922] Kernel Offset: 0x34400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  522.681922] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
qemu-system-x86_64: terminating on signal 15 from pid 92294 (timeout)
E: test timed out after 600 s

Base automatically changed from master to main January 21, 2021 11:54
Comment on lines +484 to +487
Manager* manager_pin_all_cgroup_bpf_programs(Manager *m);
void manager_unpin_all_cgroup_bpf_programs(Manager *m);
DEFINE_TRIVIAL_CLEANUP_FUNC(Manager*, manager_unpin_all_cgroup_bpf_programs);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if *_pin_* is the best naming here. BPF has a pinning concept that is different from the one used here.

I got confused at the beginning because I thought manager_pin_all_cgroup_bpf_programs was pinning BPF objects to the BPF File System but it's doing a completely different thing.

Copy link
Contributor

@Werkov Werkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detaching the programs on reload looks really like a bug (assuming the programs are meant for restrictions). The proposed approach looks better, although it'd be tilted a bit to the other side -- there'll be a short period of time when cgroup is more restricted until the old prog is detached. If only there was a way to swap a prog, BPF_F_REPLACE maybe? (Note this suggestion isn't part of the requested changes, those are for the inline comments.)

continue;
}

bpf_program_ref(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference should only be taken on actual insertion into the set, shouldn't it?
(I guess this works so far, because one BPFProgram can only be attached to single unit only but I wouldn't rely on it here.)

continue;
}

bpf_program_ref(p);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference should only be taken on actual insertion into the set, shouldn't it?

return 0;
}

void bpf_program_skeletonize(BPFProgram *p) {
Copy link
Contributor

@Werkov Werkov Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LOL, the name :-)
IIUC, this is subset of bpf_program_free. Wouldn't it be more maintanable to call this function from bpf_program_free instead of repeated code?


flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI &&
(u->type == UNIT_SLICE || unit_cgroup_delegate(u))) ? BPF_F_ALLOW_MULTI : 0;
flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI) ? BPF_F_ALLOW_MULTI : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves comment like "go with BPF_F_ALLOW_MULTI because we may encounter a program from before reload".
BTW I think this should also add to NEWS (and commit log) that minimum kernel version is 4.15, otherwise daemon-reload wouldn't refresh once assigned program.

Actually, when I think about it further, it makes the BPF_FIREWALL_SUPPORTED_WITH_MULTI useless (pick one: daemon-reload or attached up-to-date BPF progs). Or am I too pessimistic here?

@cdown
Copy link
Member Author

cdown commented Mar 3, 2021

Thanks @mauriciovasquezbernal and @Werkov for the feedback -- I'm a bit stuck working on some other stuff at the moment but will rebase and come back to this in due course :-)

@bluca bluca modified the milestones: v248, v249 Mar 12, 2021
@bluca
Copy link
Member

bluca commented Mar 12, 2021

Pushing forward to v249 milestone, given we are already in -rc3 stage so a bit late for full features. We should really sort the BPF story in v249 though :-)

@poettering poettering modified the milestones: v249, v250 Jun 1, 2021
poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021
poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021
poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021
@poettering
Copy link
Member

Closing, as #19851 was merged.

@poettering poettering closed this Jun 9, 2021
dakr pushed a commit to dakr/systemd that referenced this pull request Jun 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bpf cgroups ci-fails/needs-rework 🔥 Please rework this, the CI noticed an issue with the PR pid1

Development

Successfully merging this pull request may close these issues.

DevicePolicy violated on cgroup v2 during daemon-reload