bpf: pid1: Pin reference to BPF programs for post-coldplug #17495

cdown · 2020-10-30T18:07:55Z

During daemon-reload and daemon-reexec, we detach and reattach all BPF programs attached to cgroups. This, however, poses a real practical problem for DevicePolicy (and some other settings using BPF): it presents a period of time where the old device filtering BPF program has been unloaded, but the new one has not been loaded yet.

Since the filtering is at open() time, it has become apparent that that there's a non-trivial period where applications inside that ostensibly filtered cgroup can grab any device -- and often do so -- and then retain access to that device even after the reload is over. Due to the file continuing to be available after the initial open(), this issue is particularly visible for DevicePolicy={strict,closed}, however it also applies to other BPF programs we install.

In particular, for BPF ingress/egress filtering this may have more concerning implications: network traffic which is supposed to be filtered will -- for a very brief period of time -- not be filtered or subject to any restrictions imposed by BPF.

These BPF programs are fundamentally attached to a cgroup lifetime, not our unit lifetime, so it's enough to pin these programs by taking a reference to affected BPF programs before reload/reexec. We can then serialise the program's kernel-facing FD and cgroup attachment FD for the new daemon, and have the daemon on the other side unpin the programs after it's finished with coldplug.

That means that, for example, the BPF program lifecycle during daemon-reload or daemon-reexec changes from this:

manager_clear_jobs_and_units
             │
      ╔══════╪═════════╤═══════╗
      ║ prog │ no prog │ prog' ║
      ╚══════╧═════════╪═══════╝
                       │
                manager_coldplug

to this:

manager_clear_jobs_and_units         manager_dispatch_cgroup_realize_queue
             │                                       │
      ╔══════╪═══════════════╤═══════════════════════╪═══════╗
      ║ prog │ prog (orphan) │ prog (orphan) + prog' │ prog' ║
      ╚══════╧═══════════════╪═══════════════════════╧═══════╝
                             │
                          manager_coldplug

For daemon-reexec the semantics are mostly the same, but the point at which the program becomes orphan is tied to the process lifecycle instead.

None of the BPF programs we install require exclusive access, so having multiple instances of them running at the same time is fine. Custom programs, of course, are unknown, but it's hard to imagine legitimate cases which should be affected, whereas the benefits of this "overlap" approach with reference pinning is immediately tangible.

src/core/manager.c

vcaputo · 2020-10-30T20:15:37Z

Since I'm unfamiliar with this area of cgroups+bpf, I'm not going to comment/review beyond obvious programming errors, which seem to be addressed AFAICT.

src/core/manager.c

keszybz

Looks OK.

src/core/manager.c

poettering · 2020-11-04T14:13:25Z

This doesn't look right. While it addresses "daemon-reload". It doesn't fix things on "daemon-exec."

This needs to be solved differently: the bpf fds need to be serialized and deserialized. i.e. the serializing functions have the "FDSet" parameter that fds can be added to that shall be passed across the daemon-reexec/daemon-reload. The BPF attachmet fds need to be added there, and the nr serialized. Then after coming back the fd must be pulled out of the fdset the deserializing function get handed in.

src/core/manager.c

src/core/main.c

yuwata

I am not sure for the implementation detail. Several nitpicky comments.

src/core/manager.c

cdown · 2021-01-07T19:08:30Z

I agree that it should. But right now the pinning is only done in the path for reload, not reexec.

Hey @keszybz, in case it got lost, could you clarify this? I have this running on all FB servers and BPF programs pinning for reexec works just fine with the FD passing as far as I can see:

[root@ktst ~]# strace -Tfe trace=bpf -p 1  # during daemon-reexec
strace: Process 1 attached
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=62, insns=0x557bb8092d70, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 25 <0.004330>
bpf(BPF_PROG_ATTACH, {target_fd=27, attach_bpf_fd=25, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000067>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8067f40, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 27 <0.000461>
bpf(BPF_PROG_ATTACH, {target_fd=49, attach_bpf_fd=27, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000060>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb80496c0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 49 <0.000407>
bpf(BPF_PROG_ATTACH, {target_fd=50, attach_bpf_fd=49, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000060>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=63, insns=0x557bb8093220, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 50 <0.005119>
bpf(BPF_PROG_ATTACH, {target_fd=54, attach_bpf_fd=50, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000101>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8003620, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 54 <0.000505>
bpf(BPF_PROG_ATTACH, {target_fd=55, attach_bpf_fd=54, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000065>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb8061fe0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 55 <0.000430>
bpf(BPF_PROG_ATTACH, {target_fd=56, attach_bpf_fd=55, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000061>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_DEVICE, insn_cnt=87, insns=0x557bb80937a0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 56 <0.008111>
bpf(BPF_PROG_ATTACH, {target_fd=57, attach_bpf_fd=56, attach_type=BPF_CGROUP_DEVICE, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000063>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb804b2f0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 57 <0.000433>
bpf(BPF_PROG_ATTACH, {target_fd=58, attach_bpf_fd=57, attach_type=BPF_CGROUP_INET_EGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000082>
bpf(BPF_PROG_LOAD, {prog_type=BPF_PROG_TYPE_CGROUP_SKB, insn_cnt=8, insns=0x557bb804b1a0, license="GPL", log_level=0, log_size=0, log_buf=NULL, kern_version=KERNEL_VERSION(0, 0, 0), prog_flags=0, prog_name="", prog_ifindex=0, expected_attach_type=BPF_CGROUP_INET_INGRESS, prog_btf_fd=0, func_info_rec_size=0, func_info=NULL, func_info_cnt=0, line_info_rec_size=0, line_info=NULL, line_info_cnt=0, attach_btf_id=0, attach_prog_fd=0}, 120) = 58 <0.000466>
bpf(BPF_PROG_ATTACH, {target_fd=59, attach_bpf_fd=58, attach_type=BPF_CGROUP_INET_INGRESS, attach_flags=BPF_F_ALLOW_MULTI, replace_bpf_fd=0}, 120) = 0 <0.000070>
bpf(BPF_PROG_DETACH, {target_fd=59, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000063>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_EGRESS}, 120) = 0 <0.000086>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000059>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000060>
bpf(BPF_PROG_DETACH, {target_fd=28, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000062>
bpf(BPF_PROG_DETACH, {target_fd=26, attach_type=BPF_CGROUP_DEVICE}, 120) = 0 <0.000059>
bpf(BPF_PROG_DETACH, {target_fd=26, attach_type=BPF_CGROUP_INET_INGRESS}, 120) = 0 <0.000066>

During `daemon-reload` and `daemon-reexec`, we detach and reattach all BPF programs attached to cgroups. This, however, poses a real practical problem for DevicePolicy (and some other settings using BPF): it presents a period of time where the old device filtering BPF program has been unloaded, but the new one has not been loaded yet. Since the filtering is at open() time, it has become apparent that that there's a non-trivial period where applications inside that ostensibly filtered cgroup can grab any device -- and often do so -- and then retain access to that device even after the reload is over. Due to the file continuing to be available after the initial open(), this issue is particularly visible for DevicePolicy={strict,closed}, however it also applies to other BPF programs we install. In particular, for BPF ingress/egress filtering this may have more concerning implications: network traffic which is supposed to be filtered will -- for a very brief period of time -- not be filtered or subject to any restrictions imposed by BPF. These BPF programs are fundamentally attached to a cgroup lifetime, not our unit lifetime, so it's enough to pin these programs by taking a reference to affected BPF programs before reload/reexec. We can then serialise the program's kernel-facing FD and cgroup attachment FD for the new daemon, and have the daemon on the other side unpin the programs after it's finished with coldplug. That means that, for example, the BPF program lifecycle during daemon-reload or daemon-reexec changes from this: manager_clear_jobs_and_units │ ╔══════╪═════════╤═══════╗ ║ prog │ no prog │ prog' ║ ╚══════╧═════════╪═══════╝ │ manager_coldplug to this: manager_clear_jobs_and_units manager_dispatch_cgroup_realize_queue │ │ ╔══════╪═══════════════╤═══════════════════════╪═══════╗ ║ prog │ prog (orphan) │ prog (orphan) + prog' │ prog' ║ ╚══════╧═══════════════╪═══════════════════════╧═══════╝ │ manager_coldplug For daemon-reexec the semantics are mostly the same, but the point at which the program becomes orphan is tied to the process lifecycle instead. None of the BPF programs we install require exclusive access, so having multiple instances of them running at the same time is fine. Custom programs, of course, are unknown, but it's hard to imagine legitimate cases which should be affected, whereas the benefits of this "overlap" approach with reference pinning is immediately tangible. [keszybz: use _cleanup_ for unpin, use FOREACH_POINTER]

cdown · 2021-01-19T14:27:41Z

@mrc0mmand Hey! My only guess is the Arch failure is due to this:

Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==LeakSanitizer has encountered a fatal error.
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

I don't know why leak sanitiser should fire in dbus-daemon of all places, or why that should cause us to kill init, since (as far as I can tell...) this doesn't touch its code it inherits from libsystemd at all, but maybe the new behaviour tickles something in just the right way.

I'm still unable to repro, so any chance of running dbus.service with LSAN_OPTIONS=verbosity=1:log_threads=1 to see what's going on here? Thanks!

I also notice we already ignore dbus-daemon memory leaks in some other parts of the CI infra, although I don't know the history around that.

mrc0mmand · 2021-01-19T14:34:45Z

@mrc0mmand Hey! My only guess is the Arch failure is due to this:
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==LeakSanitizer has encountered a fatal error.
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
Jan 18 15:41:40 systemd-testsuite dbus-daemon[270]: ==270==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)
I don't know why leak sanitiser should fire in dbus-daemon of all places, or why that should cause us to kill init, since (as far as I can tell...) this doesn't touch its code it inherits from libsystemd at all, but maybe the new behaviour tickles something in just the right way.

I'm still unable to repro, so any chance of running dbus.service with LSAN_OPTIONS=verbosity=1:log_threads=1 to see what's going on here? Thanks!

You can safely ignore the dbus message (it happens in all runs when dbus is shutting down and it doesn't affect the job in any way, at least from what I've seen). However, I'll look into it anyway to get rid of the error.

The main culprit here is the kernel panic at the end:

[�[0;32m  OK  �[0m] Reached target �[0;1;39mPower-Off�[0m.
[  457.730019] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  457.732696] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  457.732696] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  457.732696] Call Trace:
[  457.732696]  dump_stack+0x6b/0x83
[  457.732696]  panic+0x112/0x2e8
[  457.732696]  do_exit.cold+0x2c/0xb3
[  457.732696]  do_group_exit+0x33/0xa0
[  457.732696]  __x64_sys_exit_group+0x14/0x20
[  457.732696]  do_syscall_64+0x33/0x40
[  457.732696]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  457.732696] RIP: 0033:0x7f6b4803e5ee
[  457.732696] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  457.732696] RSP: 002b:00007ffdc5e56d90 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  457.732696] RAX: ffffffffffffffda RBX: 00007f6b481d86b8 RCX: 00007f6b4803e5ee
[  457.732696] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  457.732696] RBP: 00007f6b481d86b8 R08: 00007f6b4151e56b R09: 00007ffdc5e56a00
[  457.732696] R10: 0000000000000043 R11: 0000000000000206 R12: 00000fed685fa048
[  457.732696] R13: 00007f6b42fd01f0 R14: 00007f6b42fd0240 R15: 00007f6b42fd01f0
[  457.732696] Kernel Offset: 0x10600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  457.732696] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

Unfortunately, I have no idea how to debug this in some meaningful way, since neither the journal nor the console log show anything useful (but that may be just my inexperience with kernel debugging).

cdown · 2021-01-19T14:44:07Z

Thanks. So far I've not been able to repro this panic under ASAN/UBSAN, although I assume systemd is dying from some sanitizer failure. I also don't see anything obvious from the code, although maybe it's more insidious. I'll keep trying though.

mrc0mmand · 2021-01-20T11:41:26Z

Thanks. So far I've not been able to repro this panic under ASAN/UBSAN, although I assume systemd is dying from some sanitizer failure. I also don't see anything obvious from the code, although maybe it's more insidious. I'll keep trying though.

I played around a bit and disabled LSan completely for the dbus service. However, even with dbus.service shutting down correctly, the kernel panic is still there:

[  OK  ] Finished Load/Save Random Seed.
[    7.006899] sh[282]: + systemctl daemon-reload
[    7.838122] sh[282]: + echo OK
[  OK  ] Finished TEST-01-BASIC.
[  OK  ] Reached target Testsuite target.
         Starting End the test...
[    7.871130] sh[315]: + systemctl poweroff --no-block
[  OK  ] Stopped End the test.
[  OK  ] Removed slice system-modprobe.slice.
[  OK  ] Stopped target Testsuite target.
[  OK  ] Stopped target Multi-User System.
[  OK  ] Stopped target Login Prompts.
[  OK  ] Stopped target Timers.
[  OK  ] Stopped Daily Cleanup of Temporary Directories.
         Stopping D-Bus System Message Bus...
         Stopping Serial Getty on ttyS0...
         Stopping User Login Management...
         Stopping Load/Save Random Seed...
[  OK  ] Stopped D-Bus System Message Bus.
[  OK  ] Stopped Serial Getty on ttyS0.
[  OK  ] Removed slice system-serial\x2dgetty.slice.
         Stopping Permit User Sessions...
[  OK  ] Stopped User Login Management.
[  OK  ] Stopped Permit User Sessions.
[  OK  ] Stopped target Basic System.
[  OK  ] Stopped target Paths.
[  OK  ] Stopped target Slices.
[  OK  ] Removed slice User and Session Slice.
[  OK  ] Stopped target Sockets.
[  OK  ] Closed D-Bus System Message Bus Socket.
[  OK  ] Stopped target System Initialization.
[  OK  ] Stopped target Local Encrypted Volumes.
[  OK  ] Stopped Dispatch Password …ts to Console Directory Watch.
[  OK  ] Stopped Forward Password R…uests to Wall Directory Watch.
[  OK  ] Stopped target Local Verity Integrity Protected Volumes.
[  OK  ] Stopped Apply Kernel Variables.
[  OK  ] Stopped Update is Completed.
[  OK  ] Stopped Rebuild Dynamic Linker Cache.
         Stopping Update UTMP about System Boot/Shutdown...
[  OK  ] Stopped Load/Save Random Seed.
[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
[  OK  ] Stopped target Local File Systems (Pre).
         Unmounting Temporary Directory (/tmp)...
[  OK  ] Stopped Create Static Device Nodes in /dev.
[  OK  ] Stopped Create System Users.
[  OK  ] Stopped Remount Root and Kernel File Systems.
[  OK  ] Unmounted Temporary Directory (/tmp).
[  OK  ] Stopped target Swap.
[  OK  ] Reached target Shutdown.
[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Power-Off.
[  OK  ] Reached target Power-Off.
[  510.746284] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  510.746960] CPU: 3 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  510.746960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  510.746960] Call Trace:
[  510.746960]  dump_stack+0x6b/0x83
[  510.746960]  panic+0x112/0x2e8
[  510.746960]  do_exit.cold+0x2c/0xb3
[  510.746960]  do_group_exit+0x33/0xa0
[  510.746960]  __x64_sys_exit_group+0x14/0x20
[  510.746960]  do_syscall_64+0x33/0x40
[  510.746960]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  510.746960] RIP: 0033:0x7fb3aaea25ee
[  510.746960] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  510.746960] RSP: 002b:00007ffd6f99ef90 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  510.746960] RAX: ffffffffffffffda RBX: 00007fb3ab03c6b8 RCX: 00007fb3aaea25ee
[  510.746960] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  510.746960] RBP: 00007fb3ab03c6b8 R08: 00007fb3a435256b R09: 00007ffd6f99ec00
[  510.746960] R10: 0000000000000043 R11: 0000000000000206 R12: 00000ff674bcda48
[  510.746960] R13: 00007fb3a5e6d1f0 R14: 00007fb3a5e6d240 R15: 00007fb3a5e6d1f0
[  510.746960] Kernel Offset: 0xe00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  510.746960] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
qemu-system-x86_64: terminating on signal 15 from pid 79410 (timeout)
E: test timed out after 1000 s

mrc0mmand · 2021-01-20T16:39:22Z

Hmpf, it looks like the kernel panic is caused by a watchdog timing out:

[  OK  ] Reached target Unmount All Filesystems.
[  OK  ] Reached target Final Step.
[  OK  ] Finished Power-Off.
[  OK  ] Reached target Power-Off.
[  146.978652] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Connection refused
[  216.978127] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  316.978167] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  396.978375] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  462.645989] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  462.648636] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.7-arch1-1 #1
[  462.648636] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  462.648636] Call Trace:
[  462.648636]  dump_stack+0x6b/0x83
[  462.648636]  panic+0x112/0x2e8
[  462.648636]  do_exit.cold+0x2c/0xb3
[  462.648636]  do_group_exit+0x33/0xa0
[  462.648636]  __x64_sys_exit_group+0x14/0x20
[  462.648636]  do_syscall_64+0x33/0x40
[  462.648636]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  462.648636] RIP: 0033:0x7f0ac3e265ee
[  462.648636] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  462.648636] RSP: 002b:00007fffaabee950 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  462.648636] RAX: ffffffffffffffda RBX: 00007f0ac3fc06b8 RCX: 00007f0ac3e265ee
[  462.648636] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  462.648636] RBP: 00007f0ac3fc06b8 R08: 00007f0abe28156b R09: 00007fffaabee5c0
[  462.648636] R10: 0000000000000043 R11: 0000000000000206 R12: 00000fe157db1048
[  462.648636] R13: 00007f0abed881f0 R14: 00007f0abed88240 R15: 00007f0abed881f0
[  462.648636] Kernel Offset: 0x16a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  462.648636] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---

[  OK  ] Reached target Shutdown.
[   95.187592] systemd[1]: final.target: starting held back, waiting for: umount.target
[   95.200709] systemd[1]: umount.target changed dead -> active
[   95.210242] systemd[1]: umount.target: Job 217 umount.target/start finished, result=done
[   95.224371] systemd[1]: Reached target Unmount All Filesystems.
[  OK  ] Reached target Unmount All Filesystems.
[   95.241724] systemd[1]: final.target changed dead -> active
[   95.249911] systemd[1]: final.target: Job 216 final.target/start finished, result=done
[   95.261941] systemd[1]: Reached target Final Step.
[  OK  ] Reached target Final Step.
[   95.275956] systemd[1]: systemd-poweroff.service: Changed dead -> start
[   95.289575] systemd[1]: systemd-poweroff.service: Succeeded.
[   95.299939] systemd[1]: systemd-poweroff.service: Service will not restart (restart setting)
[   95.314146] systemd[1]: systemd-poweroff.service: Changed start -> dead
[   95.327216] systemd[1]: systemd-poweroff.service: Job 136 systemd-poweroff.service/start finished, result=done
[   95.344233] systemd[1]: Finished Power-Off.
[  OK  ] Finished Power-Off.
[   95.358169] systemd[1]: Forcibly powering off: unit systemd-poweroff.service succeeded
[   95.358538] audit: type=1130 audit(1611431828.120:92): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-poweroff comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   95.372553] systemd[1]: poweroff.target changed dead -> active
[   95.399245] audit: type=1131 audit(1611431828.120:93): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-poweroff comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[   95.407186] systemd[1]: poweroff.target: Job 135 poweroff.target/start finished, result=done
[   95.447893] systemd[1]: Reached target Power-Off.
[  OK  ] Reached target Power-Off.
[   95.462562] systemd[1]: Shutting down.
[   95.493940] audit: type=1334 audit(1611431828.256:94): prog-id=20 op=UNLOAD
[   95.504388] audit: type=1334 audit(1611431828.256:95): prog-id=19 op=UNLOAD
[   95.555764] audit: type=1334 audit(1611431828.320:96): prog-id=22 op=UNLOAD
[   95.563093] systemd[1]: systemd-journald.service: Releasing resources.
[   95.567240] audit: type=1334 audit(1611431828.320:97): prog-id=21 op=UNLOAD
[   95.577412] systemd[1]: systemd-journald.service: Releasing all stored fds
[  154.937590] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Connection refused
[  154.952996] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Connection refused
[  224.937099] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  224.955917] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  324.937143] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  324.962773] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  404.937281] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  404.956226] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  514.937139] systemd-journald[227]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected
[  514.955556] systemd-journald[227]: Event source n/a (type io) returned error, disabling: Transport endpoint is not connected
[  522.679258] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
[  522.681922] CPU: 0 PID: 1 Comm: systemd Not tainted 5.10.9-arch1-1 #1
[  522.681922] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[  522.681922] Call Trace:
[  522.681922]  dump_stack+0x6b/0x83
[  522.681922]  panic+0x112/0x2e8
[  522.681922]  do_exit.cold+0x2c/0xb3
[  522.681922]  do_group_exit+0x33/0xa0
[  522.681922]  __x64_sys_exit_group+0x14/0x20
[  522.681922]  do_syscall_64+0x33/0x40
[  522.681922]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  522.681922] RIP: 0033:0x7f8126ad65ee
[  522.681922] Code: 00 0f 05 c3 0f 1f 84 00 00 00 00 00 b8 18 00 00 00 0f 05 c3 0f 1f 84 00 00 00 00 00 48 83 ec 08 48 63 ff b8 e7 00 00 00 0f 05 <67> e8 1c 71 00 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 48 c7 44 24
[  522.681922] RSP: 002b:00007ffecc6d2290 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7
[  522.681922] RAX: ffffffffffffffda RBX: 00007f8126c706b8 RCX: 00007f8126ad65ee
[  522.681922] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000001
[  522.681922] RBP: 00007f8126c706b8 R08: 00007f811ebee56b R09: 00007ffecc6d1f00
[  522.681922] R10: 0000000000000043 R11: 0000000000000206 R12: 00000ff024346048
[  522.681922] R13: 00007f8121a301f0 R14: 00007f8121a30240 R15: 00007f8121a301f0
[  522.681922] Kernel Offset: 0x34400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  522.681922] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100 ]---
qemu-system-x86_64: terminating on signal 15 from pid 92294 (timeout)
E: test timed out after 600 s

mauriciovasquezbernal · 2021-01-29T14:41:12Z

src/core/manager.h

+Manager* manager_pin_all_cgroup_bpf_programs(Manager *m);
+void manager_unpin_all_cgroup_bpf_programs(Manager *m);
+DEFINE_TRIVIAL_CLEANUP_FUNC(Manager*, manager_unpin_all_cgroup_bpf_programs);
+


I'm wondering if *_pin_* is the best naming here. BPF has a pinning concept that is different from the one used here.

I got confused at the beginning because I thought manager_pin_all_cgroup_bpf_programs was pinning BPF objects to the BPF File System but it's doing a completely different thing.

Werkov

Detaching the programs on reload looks really like a bug (assuming the programs are meant for restrictions). The proposed approach looks better, although it'd be tilted a bit to the other side -- there'll be a short period of time when cgroup is more restricted until the old prog is detached. If only there was a way to swap a prog, BPF_F_REPLACE maybe? (Note this suggestion isn't part of the requested changes, those are for the inline comments.)

Werkov · 2021-03-03T17:52:40Z

src/core/manager.c

+                                        continue;
+                                }
+
+                                bpf_program_ref(p);


The reference should only be taken on actual insertion into the set, shouldn't it?
(I guess this works so far, because one BPFProgram can only be attached to single unit only but I wouldn't rely on it here.)

Werkov · 2021-03-03T17:52:53Z

src/core/manager.c

+                                        continue;
+                                }
+
+                                bpf_program_ref(p);


The reference should only be taken on actual insertion into the set, shouldn't it?

Werkov · 2021-03-03T17:57:25Z

src/shared/bpf-program.c

        return 0;
 }

+void bpf_program_skeletonize(BPFProgram *p) {


(LOL, the name :-)
IIUC, this is subset of bpf_program_free. Wouldn't it be more maintanable to call this function from bpf_program_free instead of repeated code?

Werkov · 2021-03-03T18:10:20Z

src/core/bpf-firewall.c


-        flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI &&
-                 (u->type == UNIT_SLICE || unit_cgroup_delegate(u))) ? BPF_F_ALLOW_MULTI : 0;
+        flags = (supported == BPF_FIREWALL_SUPPORTED_WITH_MULTI) ? BPF_F_ALLOW_MULTI : 0;


I think this deserves comment like "go with BPF_F_ALLOW_MULTI because we may encounter a program from before reload".
BTW I think this should also add to NEWS (and commit log) that minimum kernel version is 4.15, otherwise daemon-reload wouldn't refresh once assigned program.

Actually, when I think about it further, it makes the BPF_FIREWALL_SUPPORTED_WITH_MULTI useless (pick one: daemon-reload or attached up-to-date BPF progs). Or am I too pessimistic here?

cdown · 2021-03-03T19:37:21Z

Thanks @mauriciovasquezbernal and @Werkov for the feedback -- I'm a bit stuck working on some other stuff at the moment but will rebase and come back to this in due course :-)

bluca · 2021-03-12T12:15:40Z

Pushing forward to v249 milestone, given we are already in -rc3 stage so a bit late for full features. We should really sort the BPF story in v249 though :-)

Alternative to systemd#17495

poettering · 2021-06-09T07:48:13Z

Closing, as #19851 was merged.

Alternative to systemd#17495

cdown added cgroups pid1 bpf labels Oct 30, 2020

cdown mentioned this pull request Oct 30, 2020

DevicePolicy violated on cgroup v2 during daemon-reload #17489

Closed

vcaputo reviewed Oct 30, 2020

View reviewed changes

src/core/manager.c Outdated Show resolved Hide resolved

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 7486807 to f6d89a0 Compare October 30, 2020 19:05

systemd deleted a comment Oct 30, 2020

yuwata reviewed Nov 1, 2020

View reviewed changes

src/core/manager.c Outdated Show resolved Hide resolved

keszybz added this to the v247 milestone Nov 3, 2020

keszybz reviewed Nov 4, 2020

View reviewed changes

src/core/manager.c Outdated Show resolved Hide resolved

poettering added the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 4, 2020

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 08cd2de to 5d208f9 Compare November 12, 2020 13:49

yuwata requested changes Nov 12, 2020

View reviewed changes

src/core/manager.c Outdated Show resolved Hide resolved

src/core/manager.c Outdated Show resolved Hide resolved

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 9a7ddd4 to 0447b6c Compare November 12, 2020 14:00

cdown requested a review from yuwata November 12, 2020 14:00

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 0447b6c to 980dd3f Compare November 12, 2020 14:06

cdown removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Nov 12, 2020

cdown linked an issue Nov 12, 2020 that may be closed by this pull request

DevicePolicy violated on cgroup v2 during daemon-reload #17489

Closed

cdown commented Nov 12, 2020

View reviewed changes

src/core/main.c Outdated Show resolved Hide resolved

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch 3 times, most recently from 4d0f02d to 22142e7 Compare November 13, 2020 01:57

yuwata reviewed Nov 13, 2020

View reviewed changes

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 22142e7 to 48e43ce Compare November 13, 2020 12:01

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 4e3e475 to 1b1a2c9 Compare December 11, 2020 21:57

bluca removed the needs-rebase label Dec 14, 2020

anitazha removed the reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks label Jan 14, 2021

cdown force-pushed the cdown/bpf_pin/2020-10-30 branch from 1b1a2c9 to 4e42210 Compare January 18, 2021 15:20

Base automatically changed from master to main January 21, 2021 11:54

jkartseva mentioned this pull request Jan 28, 2021

bpf: extend bpf cgroup program support #13496

Merged

mauriciovasquezbernal reviewed Jan 29, 2021

View reviewed changes

This was referenced Jan 29, 2021

Introduce SocketBind{Allow|Deny}= properties powered by source compiled BPF #17655

Merged

Add RestrictNetworkInterfaces= #18385

Merged

Werkov requested changes Mar 3, 2021

View reviewed changes

bluca modified the milestones: v248, v249 Mar 12, 2021

poettering modified the milestones: v249, v250 Jun 1, 2021

poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021

bpf-program: serialize attached BPF programs across daemon reexec/reload

cb3b757

Alternative to systemd#17495

poettering mentioned this pull request Jun 8, 2021

close bpf firewall reload gap #19851

Merged

poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021

bpf-program: serialize attached BPF programs across daemon reexec/reload

33de6dc

Alternative to systemd#17495

poettering added a commit to poettering/systemd that referenced this pull request Jun 8, 2021

bpf-program: serialize attached BPF programs across daemon reexec/reload

b57d752

Alternative to systemd#17495

poettering closed this Jun 9, 2021

dakr pushed a commit to dakr/systemd that referenced this pull request Jun 14, 2021

bpf-program: serialize attached BPF programs across daemon reexec/reload

77946d5

Alternative to systemd#17495

Uh oh!

bpf: pid1: Pin reference to BPF programs for post-coldplug #17495

bpf: pid1: Pin reference to BPF programs for post-coldplug #17495

Uh oh!

Conversation

cdown commented Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vcaputo commented Oct 30, 2020

Uh oh!

Uh oh!

keszybz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

poettering commented Nov 4, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuwata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cdown commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cdown commented Jan 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrc0mmand commented Jan 19, 2021

Uh oh!

cdown commented Jan 19, 2021

Uh oh!

mrc0mmand commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrc0mmand commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mauriciovasquezbernal Jan 29, 2021

Choose a reason for hiding this comment

Uh oh!

Werkov left a comment

Choose a reason for hiding this comment

Uh oh!

Werkov Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

Werkov Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

Werkov Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Werkov Mar 3, 2021

Choose a reason for hiding this comment

Uh oh!

cdown commented Mar 3, 2021

Uh oh!

bluca commented Mar 12, 2021

Uh oh!

poettering commented Jun 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

cdown commented Oct 30, 2020 •

edited

Loading

cdown commented Jan 7, 2021 •

edited

Loading

cdown commented Jan 19, 2021 •

edited

Loading

mrc0mmand commented Jan 20, 2021 •

edited

Loading

mrc0mmand commented Jan 20, 2021 •

edited

Loading

Werkov Mar 3, 2021 •

edited

Loading