add simple eBPF-based per-unit IP access lists and accounting #6764

poettering · 2017-09-07T18:19:21Z

This adds simple eBPF-based based per-unit IP access lists and accounting. It makes use of the ebpf/cgroup feature of the unified cgroup hierarchy in new kernels.

This is mostly a heavily extended version of @zonque's original userspace work, updated to the kernel APIs ultimately merged into the Linux kernel.

Here's the gist how you use it. Let's start an in interactive shell as transient unit, with IP accounting enabled:

# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u1665.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4#

Now, let's have a look at the unit we just started, from inside of it:

sh-4.4# systemctl status run-u1665.service
● run-u1665.service - /bin/sh
   Loaded: loaded (/run/systemd/transient/run-u1665.service; transient; vendor preset: disabled)
Transient: yes
   Active: active (running) since Thu 2017-09-07 20:11:58 CEST; 1min 33s ago
 Main PID: 15859 (sh)
       IP: 0B in, 0B out
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/run-u1665.service
           ├─15859 /bin/sh
           ├─15878 systemctl status run-u1665.service
           └─15879 less

Sep 07 20:11:58 sigma systemd[1]: Started /bin/sh.
sh-4.4#

Note the new status lines IP:, showing that so far no traffic was generated. Let's change that, by generating some traffic from inside the unit:

sh-4.4# ping localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=2 ttl=64 time=0.062 ms
^C
--- localhost.localdomain ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1012ms
rtt min/avg/max/mdev = 0.042/0.052/0.062/0.010 ms

sh-4.4# systemctl status run-u1665.service
● run-u1665.service - /bin/sh
   Loaded: loaded (/run/systemd/transient/run-u1665.service; transient; vendor preset: disabled)
Transient: yes
   Active: active (running) since Thu 2017-09-07 20:11:58 CEST; 3min 22s ago
 Main PID: 15859 (sh)
       IP: 168B in, 168B out
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/run-u1665.service
           ├─15859 /bin/sh
           ├─15932 systemctl status run-u1665.service
           └─15933 less

Sep 07 20:11:58 sigma systemd[1]: Started /bin/sh.
sh-4.4#

As expected, the IP: status line shows we generated some traffic. Now, let's see how the access list stuff works. Let's start from scratch:

# systemd-run -p IPAddressDeny=127.0.0.2 -t /bin/sh
Running as unit: run-u1720.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4#

This shell now can access all IP servers, except for 127.0.0.2 (which is on the local loopback). Let's try:

sh-4.4# ping -c 1 -W 5 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.091 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.091/0.091/0.091/0.000 ms
sh-4.4# ping -c 1 -W 5 127.0.0.2
PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 127.0.0.2 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping -c 1 -W 5 127.0.0.3
PING 127.0.0.3 (127.0.0.3) 56(84) bytes of data.
64 bytes from 127.0.0.3: icmp_seq=1 ttl=64 time=0.073 ms

--- 127.0.0.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.073/0.073/0.073/0.000 ms
sh-4.4#

As expected, traffic to 127.0.0.1 and 127.0.0.3 still works, but traffic to 127.0.0.2 doesn't! Yay!

keszybz

So... I'm boarding a plane to Berlin now, so I won't be able to finish the review right now. But it's clear that significant reworking will be needed anyway. I'll look at the rest of the code in the afternoon.

keszybz · 2017-09-09T08:41:32Z

src/basic/bpf-program.c

+
+        assert(p);
+
+        if (!GREEDY_REALLOC(p->instructions, p->allocated, (p->n_instructions + count) * sizeof(struct bpf_insn)))


GREEDY_REALLOC already multiplies by the size of each item.

keszybz · 2017-09-09T08:53:56Z

src/basic/bpf-program.h

+
+DEFINE_TRIVIAL_CLEANUP_FUNC(BPFProgram*, bpf_program_unref);
+
+/* The following is stolen from the kernel tree, samples/bpf/libbpf.h, which is under GPL2.0 */


I'm sorry, but everything below in this file will need to be dropped, and independently coded.

well, it's kinda an API, we define it in a .h header, and this is normally thought not to infect things license-wise... it doesn't encode algorithms, is too trivial for that... I mean, specifying the origin and its license should be fine

keszybz · 2017-09-09T08:57:19Z

src/shared/linux/bpf.h

+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */


Likewise, this file cannot be present in the systemd sources. If we use a public header provided by the system, that's different. But including it in our sources is not possible w/o bumping our own license to full GPL.

naah... we don't have to bump anything. As mentioned above I don#t think this is a real problem at all. But if indeed the header's copyright would matter, then we could still license our stuff as LGPL, though effectively it would be downgraded to GPL when you actually run it. But GPL vs LGPL only matters during runtime really, not in what we include in our tarball.

Or in other words, as long as we tell people that if they turn on the BPF stuff PID 1 will be effectively GPL it doesn't really matter at all...

I mean, quite frankly I wouldn't even have a problem at all with systemd being GPL rather than LGPL. The only reason we care so far, is that libsystemd should be linkable by non-free code, and we want the ability to copy/paste around our own code between libsystemd and PID 1 and other components freely without constantly thinking about the license...

IANAL, of course, but i don't think this is really a problem, at all. This is not code after all, just APIs...

(and btw, some components of systemd are GPL rather than LPGL anyway, src/basic/ioprio.h, src/basic/securebits.h, and some stuff of udev iirc. That's why we ship LICENSE.GPL2 in the repo after all)

topimiettinen

Looks good to me, but perhaps this could be committed separately. Is this needed by other patches?
edit: this applies to 8e0e6ec, sorry.

topimiettinen

Also this could be useful regardless of the rest of patches.
edit: this applies to df1888c.

topimiettinen · 2017-09-09T13:03:40Z

src/core/dbus-cgroup.c

+                                return r;
+                        if (r == 0)
+                                log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. "
+                                            "Proceeding WITHOUT firewalling in effect!", u->id);


Perhaps this should cause the unit to fail instead? For example, a service which should only accept requests from localnet would be exposed to more hostile networks. Though it would mean that the unit file would need editing when used on hosts without BPF/cgroup firewalling.

Well, the reason I explicitly chose not to make this fatal is that I'd like to ship all our services which do not require internet access (which are effectively all except for the journal remoting stuff), with either IPAddressDeny=any alone, or IPAddressDeny=any + IPAddressAllow=localhost, so that no IP communication is possible at all, or only with 127.0.0.1. Now, if we do set this, then we don't want to break the services completely on kernels that don't support eBPF/cgroup stuff, and thus I just added the loud warning.

Note that IPAddressDeny=any is in many ways a better option that PrivateNetwork= even though it has very similar effects. That's because PrivateNetwork= affects AF_NETLINK and AF_UNIX (at least the abstract namespace), and that's pretty awful, as it breaks libudev and such. IPAddressDeny= is hence a great option for everything that needs libudev but still wants to prohibit networking.

topimiettinen · 2017-09-09T13:04:40Z

src/core/ip-address-access.c

+        if (r == 0)
+                log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. "
+                            "Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue);
+


topimiettinen · 2017-09-09T13:26:59Z

src/core/bpf-firewall.c

+                struct bpf_insn insn[] = {
+                        /*
+                         * If R0 == 0, the packet will be denied; skip the accounting instructions in this case.
+                         * The jump label will be fixed up later.


Would be nice to keep a separate counter of the denied packets too.

Sure, but let's maybe leave that for a later PR.

topimiettinen · 2017-09-09T13:35:13Z

mkosi.build

 [ -f "$BUILDDIR"/build.ninja ] || meson "$BUILDDIR"
 ninja -C "$BUILDDIR" all
-[ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test
+[ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test || ( RET="$?" ; cat "$BUILDDIR"/meson-logs/testlog.txt ; exit "$RET" )


Maybe commit separately?

topimiettinen · 2017-09-09T14:05:37Z

Overall very interesting feature. The filtering is at pretty basic level as only addresses can be used, not ports, packet flags, SELinux labels, connection states, interfaces etc. but perhaps this can be extended in the future (in moderation).

For example, systemd-timesyncd should be able to send UDP packets to NTP servers' ports but not connect with TCP to IRC botnets. The IP addresses of the servers may not be known in advance (and the current filter syntax does not allow name resolution to be used), so it's not possible to specify an address based filter either.

topimiettinen · 2017-09-09T14:11:12Z

man/systemd.resource-control.xml

+
+          <para>In order to implement a whitelisting IP firewall, it is recommended to use a system-wide
+          <varname>IPAddressDeny=</varname><constant>any</constant> setting, plus individual per-unit
+          <varname>IPAddressAllow=</varname> lines permitting network access to relevant services, and only them.</para>


In other places the user is warned about kernel feature availability:
Note that the implementation of this setting might be impossible (for example if user namespaces are not available), and the unit should be written in a way that does not solely rely on this setting for security.

will add such a comment

poettering · 2017-09-09T15:27:44Z

Looks good to me, but perhaps this could be committed separately. Is this needed by other patches?
edit: this applies to 8e0e6ec, sorry.

yes, test-bpf.c needs that to work

poettering · 2017-09-09T15:29:11Z

Also this could be useful regardless of the rest of patches.
edit: this applies to df1888c.

Well, sure, but it also is necessary to make test-bpf.c

That said, even if it wasn't: we don't really insist on having strictly one separate PR for each topic, we permit some room to include not super-strictly related stuff in other PRs...)

poettering · 2017-09-09T15:40:40Z

Overall very interesting feature. The filtering is at pretty basic level as only addresses can be used, not ports, packet flags, SELinux labels, connection states, interfaces etc. but perhaps this can be extended in the future (in moderation).

For example, systemd-timesyncd should be able to send UDP packets to NTP servers' ports but not connect with TCP to IRC botnets. The IP addresses of the servers may not be known in advance (and the current filter syntax does not allow name resolution to be used), so it's not possible to specify an address based filter either.

I really don't want to create a second iptables here, hence we should be really careful with adding too many finegrained controls. One of the major benefits of doing this kind of service-level firewalling is that we know more about the context of the traffic than iptables ever knows: we know that nginx.service is about HTTP and ntpd.service about NTP, hence we don't have to derive application context from the port number necessarily. This is particular great for protocols such as bittorrent where the port number is effectively random.

But yeah, it certainly is in the cards to extend this a bit later, but we should keep a focus on keeping this simple, and not make it a fully generic iptables-reeimplementation.

I am very much against doing name lookups for IP ACL configuration. DNS is relatively dynamic, and requires networking to work (except in exceptional cases), and doing networking in order to restrict networking is kinda contradictory...

keszybz

The patches look great. I found a few nitpicks here and there, mostly in docs, but nothing important. I looked over the BPF code, but it'd be good if somebody who knows BPF well looked at it.

With the headers — let me think a bit / read up on that.

keszybz · 2017-09-09T18:00:06Z

src/basic/cgroup-util.c

+                const char *controller,
+                const char *path,
+                const char *event,
+                char **val){


Missing space before brace.

keszybz · 2017-09-09T18:29:18Z

src/test/test-bpf.c

+
+        while (!IN_SET(SERVICE(u)->state, SERVICE_DEAD, SERVICE_FAILED)) {
+                assert_se(sd_event_run(m->event, UINT64_MAX) >= 0);
+        }


Unnecessary braces.

keszybz · 2017-09-09T19:14:31Z

man/systemd-system.conf.xml

+        <listitem><para>Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that
+        is applied to all units on the system. These settings operate similar to the per-unit options of the same name,
+        see
+        <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. Note


Please drop the comma before and. (Alternatively, the comma can stay, but then another one must be added before "are". But I think it's better to drop it.)

I also think that duplicating the semantics of Deny/Allow here is unnecessary. It'd be better to just say:

Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that is applied to all units in the system. These settings are similar to the per-unit options of the same name, see <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. The lists configured with the global options and the lists configured per-unit are combined when applied, so the latter extend the former. By default global lists are empty, allowing access from all hosts. For details on the expected parameter syntax and how the allow/deny lists are applied, see the per-unit settings documentation linked above.

The other description is better, and repeated stuff tends to get out of sync...

keszybz · 2017-09-09T19:25:18Z

man/systemd.resource-control.xml

+        <listitem>
+          <para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent
+          or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of
+          the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6


IPv4 and IPv6 traffic

keszybz · 2017-09-09T19:26:07Z

man/systemd.resource-control.xml

+          <para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent
+          or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of
+          the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6
+          associated with it (including both listening and connection sockets where this applies). Note that for


~~collected~~

keszybz · 2017-09-09T19:37:06Z

src/core/dbus-cgroup.c

+                        if (r < 0)
+                                return r;
+
+                        if (an != FAMILY_ADDRESS_SIZE(family))


It'd be nice to include the size/family in the error message... Just a suggestion.

keszybz · 2017-09-09T19:39:50Z

src/core/dbus-cgroup.c

+                        size_t size = 0;
+
+                        if (n == 0) {
+                                while (*list)


OK, so an empty list resets.

keszybz · 2017-09-09T19:43:35Z

src/core/ip-address-access.c

+        }
+
+        return first;
+}


Hmm, that's somewhat tricky code. It'd be nice to add a test for this also. (And if not a test, then maybe just a TODO entry ;)).

there's a test actually for this already, in test-bpf, though it's not obvious to see... we create a redundant ACL there which should be reduced to two entries... I'll add a comment clarifying that

keszybz · 2017-09-09T19:49:48Z

src/core/dbus-cgroup.c

+                        if (r < 0)
+                                return r;
+                        if (r == 0)
+                                log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. "


I'd put a newline between the sentences. We handle mulitline messages just fine and it's easier to read.

keszybz · 2017-09-09T19:50:29Z

src/core/ip-address-access.c

+                return r;
+        if (r == 0)
+                log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. "
+                            "Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue);


topimiettinen · 2017-09-10T06:19:52Z

@poettering I don't think replicating iptables makes sense either. But IMHO the current level of functionality (address filters) is bit too simple to make much difference, I'd add protocol/port support to make it more useful.

Extending the functionality could take a different approach compared to iptables. For example, instead of adding more and more detailed filtering options, how about higher level switches like boolean IPDropUnusualPackets for filtering packets with exotic options?

keszybz · 2017-09-11T07:51:49Z

On licensing:

It is OK to use the (GPLv2-only) headers from the kernel in our code (LGPL). Ref.: http://lkml.iu.edu/hypermail/linux/kernel/0301.1/0362.html,

using structure definitions, typedefs, enumeration constants, macros with simple bodies, etc., is NOT enough to make a derivative work. It would take a substantial amount of code (coming from inline functions or macros with substantial bodies) to do that. [RMS]

Copying the full header file into our repo is also fine, as long as we keep the copyright information and the original license. In effect, that file is under it's original license, and because of 1., the code that uses it can be separately licensed. This patch series does that with linux/bpf.h. Of course we have to be careful not to move any part of that file elsewhere in the repository or create derived work through other means.
Copying parts of the header file and removing the original copyright information and license is OK only if the parts being copied are completely trivial. Using the definitions from samples/bpf/libbpf.h is OK, because of 1., since those are simple structure definitions without any code. But at the same time, the definition of those macros is not completely trivial, and (while it could be argued both ways) I think it's much better to stay on the safe side, and not strip it of the original copyright and license.

Please update the patch series to copy samples/bpf/libbpf.h in its entirety, similarly to bpf.h.

poettering · 2017-09-11T08:54:26Z

@poettering I don't think replicating iptables makes sense either. But IMHO the current level of functionality (address filters) is bit too simple to make much difference, I'd add protocol/port support to make it more useful.

Well, this PR actually adds useful functionality to programs that were previously not available: IP ACLs for everything. Some programs support something like this, but certainly not all, and with this we add to everything in the service, unconditionally. This is is very different from filtering per IP port, or per IP protocol, or by "unusual packets". All that kind of filtering is essentially unnecessary and doesn't bring much new to the table: if your kernel chokes on "unusual packets", then something is wrong anyway, and if your app does TCP on port 4711 but is not supposed to do that, then your service is borked. That's not reason not to add additional restrictions on what port numbers and which protocols services can bind to, but it's in a very different territory: the IPAddressAllow=/IPAddressDeny= stuff adds new concepts where previously there was nothing there at all, while filtering by IP port/protocol just adds a second line of protection on something that should regularly not happen anyway...

Hence, I think yes, the IP ACLs this PR adds does make a major difference. It adds something very new, and very useful.

boucman · 2017-09-11T14:14:05Z

I have to say that, for embedded-system, being able to restrict easily all services to a unique IP address is nice...

alban · 2017-09-11T16:18:31Z

man/systemd-system.conf.xml

-        settings. <varname>DefaultTasksAccounting=</varname> defaults
-        to on, the other three settings to off.</para></listitem>
+        for details on the per-unit settings. <varname>DefaultTasksAccounting=</varname> defaults to on, the other
+        four settings to off.</para></listitem>


As opposed to the other accounting options, IPAccounting= is not really hierarchical. Should a specific warning be added, for people who use systemd-nspawn or similar container tools?

So, according to Tejun there's work in progress to make things fully recursive and I am tempted to actually implement things that way already now, see further discussion below

alban · 2017-09-11T16:25:06Z

man/systemd-system.conf.xml

        <term><varname>DefaultBlockIOAccounting=</varname></term>
        <term><varname>DefaultMemoryAccounting=</varname></term>
        <term><varname>DefaultTasksAccounting=</varname></term>
+        <term><varname>DefaultIPAccouting</varname></term>


typo: IPAccounting
(same typo elsewhere)

alban · 2017-09-11T16:32:05Z

man/systemd.resource-control.xml

+          and the socket unit are kept separate, and displayed separately, no propagation of the setting and the
+          collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket
+          unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even
+          if the socket is used by it. The system default for this setting may be controlled with


s/DefaultIOAccounting/DefaultIPAccounting/

alban · 2017-09-11T16:32:53Z

man/systemd.resource-control.xml

+          collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket
+          unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even
+          if the socket is used by it. The system default for this setting may be controlled with
+          <varname>DefaultIOAccounting=</varname> in


TODO: an explanation about how Delegate= might interfere with it.

alban · 2017-09-11T17:03:32Z

We are working on a bpf program that firewall at the cgroup level and it would be nice if it could work well with systemd units. At the moment, we have the following race: our bpf program can only be installed after systemd has created the cgroup directory for the unit and if it is not fast enough, the application might already have sent/received network packets. The bpf filter needs to be installed between the fork and the exec.

So for my use-case, IPAccounting, IPAddressAllow and IPAddressDeny is not enough. I would like instead (or in addition) to have an option to specify an existing program:

IPIngressFilterBPF=/sys/fs/bpf/foo/myapp/my-filter
IPEgressFilterBPF=-/sys/fs/bpf/foo/myapp/my-filter2

So an external program could prepare a bpf program pinned on the bpf filesystem before the unit is started.

I would like to avoid Delegate= for the app because except the IP filter at the cgroup level, the app will not do other cgroup mangling.

Also, I would like to avoid using a global helper with ExecStartPre= because it would not work well with RootDirectory=, since the helper would be outside. And also, IPIngressFilterBPF would be more decoupled: if I replace my_firewall_cgroup application by my_other_firewall_cgroup I would not have to edit the ExecStartPre of every single unit file on the system as long as I keep the BPF programs pinned at the same location on the bpf filesystem.

With my solution, I don't need to have much BPF code in systemd: merely a couple of syscalls to retrieve the bpf program from the bpf filesystem and to install it on the cgroup. I would not need systemd to have any BPF assembly or to build a BPF program itself (but I understand it could be useful for other people's use cases).

/cc @iaguis @dongsupark @nhlfr

poettering · 2017-09-11T17:40:35Z

IPIngressFilterBPF=/sys/fs/bpf/foo/myapp/my-filter IPEgressFilterBPF=-/sys/fs/bpf/foo/myapp/my-filter2

makes sense to me. That said I think the common "/sys/fs/bpf" prefix should not be mentioned: we should use a relative path here, in order not to create the illusion any path outside of /sys/fs/bpf could be used here... And should we decide one day, that we want to reuse this setting for ELF filters, than that still be compatible conceptually, because for that we could then use absolute paths still.

poettering · 2017-09-11T17:42:03Z

Hmm, thinking more about that, maybe we should actually not permit configuration of any path at all actually, but just make these new options booleans and simply derive the path automatically from the unit name. i.e. if you want to install your own filter, then just name it properly in bpffs and set the bool and systemd would pick it up?

poettering · 2017-09-11T17:44:22Z

I would like to avoid Delegate= for the app because except the IP filter at the cgroup level, the app will not do other cgroup mangling.

We really should add a property that permits per-unit selection of what to delegate btw: "DelegateConcept=io bpf cpu" or something like that that permits listeing cgroup controllers but also bpf...

poettering · 2017-09-11T17:45:29Z

(anyway, the ability to enforce arbitrary foreign ebpf networking filters should probably be discussed in a separate issue and/or PR, let's leave this one for the IP accounting/filtering the suggested way)

topimiettinen · 2017-09-11T19:29:11Z

All that kind of filtering is essentially unnecessary and doesn't bring much new to the table: if your kernel chokes on "unusual packets", then something is wrong anyway, and if your app does TCP on port 4711 but is not supposed to do that, then your service is borked.

I was thinking of the special kind of borked, where the service (e.g. systemd-timesyncd) suddenly starts trying to contact its new masters in botnet control network. I don't think that this could be blocked with system wide rules (or maybe SELinux could do it), if connecting to HTTPS ports is legitimate at system level. An application level firewall can be made much tighter than system wide rules can.

…s_family() only once

…group first Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny= settings are in effect on all socket fds associated with it. In order to make this happen we need to make sure the cgroup the fds are associated with are the socket unit's cgroup. The only way to do that is invoking socket()+accept() in them. Since we really don't want to migrate PID 1 around we do this by forking off a helper process, which invokes socket()/accept() and sends the newly created fd to PID 1. Ugly, but works, and there's apparently no better way right now. This generalizes forking off per-unit helper processes in a new function unit_fork_helper_process(), which is then also used by the NSS chown() code of socket units.

Make sure the current IP accounting counters aren't lost during reload/reexec. Note that we destroy all BPF file objects during a reload: the BPF programs, the access and the accounting maps. The former two need to be regenerated anyway with the newly loaded configuration data, but the latter one needs to survive reloads/reexec. In this implementation I opted to only save/restore the accounting map content instead of the map itself. While this opens a (theoretic) window where IP traffic is still accounted to the old map after we read it out, and we thus miss a few bytes this has the benefit that we can alter the map layout between versions should the need arise.

With this change we'll invalidate all cgroup settings after coming back from a daemon reload/reexec, so that the new settings are instantly applied. This is useful for the BPF case, because we don't serialize/deserialize the BPF program fd, and hence have to install a new, updated BPF program when coming back from the reload/reexec. However, this is also useful for the rest of the cgroup settings, as it ensures that user configuration really takes effect wherever we can.

We used to be a bit sloppy on this, and handed out accounting data even for units where accounting wasn't explicitly enabled. Let's be stricter here, so that we know the accounting data is actually fully valid. This is necessary, as the accounting data is no longer stored exclusively in cgroupfs, but is partly maintained external of that, and flushed during unit starts. We should hence only expose accounting data we really know is fully current.

…red logging

…in ENVIRONMENT.md

…OB_RESULT= So, currently, some of the structured log messages we generated based on jobs carry the result in RESULT=, and others in JOB_RESULT=. Let's streamline this, as stick to JOB_RESULT= in one place. This is kind of an API break, but given that currently most software has to check both fields anyway, I think we can get away with it. Why unify on JOB_RESULT= rather than RESULT=? Well, we manage different types of result codes in systemd. Most importanlty besides job results there are also service results, and we should be explicit in what we mean here.

This adds IOVEC_INIT() and IOVEC_MAKE() for initializing iovec structures from a pointer and a size. On top of these IOVEC_INIT_STRING() and IOVEC_MAKE_STRING() are added which take a string and automatically determine the size of the string using strlen(). This patch removes the old IOVEC_SET_STRING() macro, given that IOVEC_MAKE_STRING() is now useful for similar purposes. Note that the old IOVEC_SET_STRING() invocations were two characters shorter than the new ones using IOVEC_MAKE_STRING(), but I think the new syntax is more readable and more generic as it simply resolves to a C99 literal structure initialization. Moreover, we can use very similar syntax now for initializing strings and pointer+size iovec entries. We canalso use the new macros to initialize function parameters on-the-fly or array definitions. And given that we shouldn't have so many ways to do the same stuff, let's just settle on the new macros. (This also converts some code to use _cleanup_ where dynamically allocated strings were using IOVEC_SET_STRING() before, to modernize things a bit)

keyring material should not leak into the container. So far we relied on seccomp to deny access to the keyring, but given that we now made the seccomp configurable, and access to keyctl() and friends may optionally be permitted to containers now let's make sure we disconnect the callers keyring from the keyring of PID 1 in the container.

…ournal This adds a new recognizable log message for each unit invocation that contains structured information about consumed resources of the unit as a whole after it terminated. This is particular useful for apps that want to figure out what the resource consumption of a unit given a specific invocation ID was. The log message is only generated for units that have at least one XyzAccounting= property turned on, and currently only covers IP traffic and CPU time metrics.

In times of seccomp it might very well fail, and given that we return failures from this function anyway, let's also propagate getrlimit() failures, just to be safe.

On current kernels BPF_MAP_TYPE_LPM_TRIE bpf maps are charged against RLIMIT_MEMLOCK even for privileged users that have CAP_IPC_LOCK. Given that mlock() generally ignores RLIMIT_MEMLOCK if CAP_IPC_LOCK is set this appears to be an oversight in the kernel. Either way, until that's fixed, let's just bump RLIMIT_MEMLOCK for the root user considerably, as the default is quite limiting, and doesn't permit us to create more than a few TRIE maps.

The  comment lines resulted in double newlines in the man page header, which looks quite ugly. Let's rearrange a bit so that these comments don't result in changes in the output.

…gate=yes is set Let's permit installing BPF programs in cgroup subtrees if Delegeate=yes. Let's not document this precise behaviour for now though, as most likely the logic here should become recursive, but that's only going to happen if the kernel starts supporting that. Until then, support this in a non-recursive fashion.

keszybz · 2017-09-26T12:37:33Z

src/core/ip-address-access.c

+                if (!a)
+                        return log_oom();
+
+                if (streq(word, "any")) {


Looks like some of this code should be refactored and integrated with bus_append_unit_property_assignment, but let's leave that for later.

keszybz · 2017-09-26T12:59:31Z

src/core/dbus-cgroup.c

+                        if (r < 0)
+                                return r;
+
+                        if (!IN_SET(family, AF_INET, AF_INET6))


Nitpick: IPAddressDeny= also.

keszybz · 2017-09-26T13:00:10Z

src/core/dbus-cgroup.c

+                        if (r < 0)
+                                return r;
+
+                        if (prefixlen > FAMILY_ADDRESS_SIZE(family)*8)


Nitpick: prefixlen should be included in the message.

keszybz · 2017-09-26T13:35:26Z

src/core/cgroup.c

        assert(u);
        assert(ret);

+        cc = unit_get_cgroup_context(u);


TODO: add unit_has_tasks_accounting() or so.

keszybz · 2017-09-26T13:57:59Z

src/basic/log.c

+                }
+
+                if (sendmsg(journal_fd, &mh, MSG_NOSIGNAL) >= 0)
+                        return -errno;


Hmm, there seems to be some confusion about the return code here. Should be -error I think.

@keszybz

As suggested by @keszybz in the review of systemd#6764

@keszybz

As suggested by @keszybz in the review of systemd#6764

@keszybz

As suggested by @keszybz in the review of systemd#6764 (cherry picked from commit 4fe66c8)

poettering added the pid1 label Sep 7, 2017

poettering force-pushed the bpf-firewall branch 5 times, most recently from a655e5f to f508179 Compare September 8, 2017 12:58

keszybz requested changes Sep 9, 2017

View reviewed changes

topimiettinen reviewed Sep 9, 2017

View reviewed changes

keszybz reviewed Sep 9, 2017

View reviewed changes

alban reviewed Sep 11, 2017

View reviewed changes

poettering and others added 20 commits September 22, 2017 15:24

core: warn loudly if IP firewalling is configured but not in effect

078ba55

socket-label: let's use IN_SET, so that we have to call socket_addres…

5ed272c

…s_family() only once

man: document the new ip accounting and filting directives

8d8631d

core: make sure to log invocation ID of units also when doing structu…

f1c50be

…red logging

documentation: document nss-systemd's internal environment variables …

dba1bd4

…in ENVIRONMENT.md

rlimit: don't assume getrlimit() always succeeds

c4ad3f4

In times of seccomp it might very well fail, and given that we return failures from this function anyway, let's also propagate getrlimit() failures, just to be safe.

man: drop misplaced "," before "-.slice"

ee85993

man: remove double newlines in systemd.special man page header

1180181

The  comment lines resulted in double newlines in the man page header, which looks quite ugly. Let's rearrange a bit so that these comments don't result in changes in the output.

man: document two more special units

1c38277

update TODO

22c8321

poettering force-pushed the bpf-firewall branch from 65864f5 to 22c8321 Compare September 22, 2017 13:28

poettering removed the ci-failure-appears-unrelated label Sep 22, 2017

poettering mentioned this pull request Sep 25, 2017

make execute.c logging a bit less special #6915

Merged

keszybz approved these changes Sep 26, 2017

View reviewed changes

keszybz merged commit 22c8321 into systemd:master Sep 26, 2017

poettering mentioned this pull request Sep 26, 2017

Some minor follow-ups for the ebpf/cgroup PR #6919

Merged

poettering added a commit to poettering/systemd that referenced this pull request Sep 26, 2017

core: improve dbus-cgroup error message

51d5c42

As suggested by @keszybz in the review of systemd#6764

poettering added a commit to poettering/systemd that referenced this pull request Sep 26, 2017

core: improve dbus-cgroup error message

4fe66c8

As suggested by @keszybz in the review of systemd#6764

alban mentioned this pull request Nov 29, 2018

RFE: BPF programs on cgroups #10227

Closed

Werkov pushed a commit to Werkov/systemd that referenced this pull request Jun 3, 2020

core: improve dbus-cgroup error message

3ff2299

As suggested by @keszybz in the review of systemd#6764 (cherry picked from commit 4fe66c8)


		assert(p);

		if (!GREEDY_REALLOC(p->instructions, p->allocated, (p->n_instructions + count) * sizeof(struct bpf_insn)))


		DEFINE_TRIVIAL_CLEANUP_FUNC(BPFProgram*, bpf_program_unref);

		/* The following is stolen from the kernel tree, samples/bpf/libbpf.h, which is under GPL2.0 */

Uh oh!

add simple eBPF-based per-unit IP access lists and accounting #6764

add simple eBPF-based per-unit IP access lists and accounting #6764

Uh oh!

Conversation

poettering commented Sep 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keszybz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topimiettinen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topimiettinen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

topimiettinen commented Sep 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Sep 9, 2017

Uh oh!

poettering commented Sep 9, 2017

Uh oh!

poettering commented Sep 9, 2017

Uh oh!

keszybz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Sep 7, 2017 •

edited

Loading

topimiettinen left a comment •

edited

Loading

topimiettinen left a comment •

edited

Loading

keszybz commented Sep 11, 2017 •

edited

Loading