Skip to content

Conversation

@poettering
Copy link
Member

@poettering poettering commented Sep 7, 2017

This adds simple eBPF-based based per-unit IP access lists and accounting. It makes use of the ebpf/cgroup feature of the unified cgroup hierarchy in new kernels.

This is mostly a heavily extended version of @zonque's original userspace work, updated to the kernel APIs ultimately merged into the Linux kernel.

Here's the gist how you use it. Let's start an in interactive shell as transient unit, with IP accounting enabled:

# systemd-run -p IPAccounting=1 -t /bin/sh
Running as unit: run-u1665.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4#

Now, let's have a look at the unit we just started, from inside of it:

sh-4.4# systemctl status run-u1665.service
● run-u1665.service - /bin/sh
   Loaded: loaded (/run/systemd/transient/run-u1665.service; transient; vendor preset: disabled)
Transient: yes
   Active: active (running) since Thu 2017-09-07 20:11:58 CEST; 1min 33s ago
 Main PID: 15859 (sh)
       IP: 0B in, 0B out
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/run-u1665.service
           ├─15859 /bin/sh
           ├─15878 systemctl status run-u1665.service
           └─15879 less

Sep 07 20:11:58 sigma systemd[1]: Started /bin/sh.
sh-4.4# 

Note the new status lines IP:, showing that so far no traffic was generated. Let's change that, by generating some traffic from inside the unit:

sh-4.4# ping localhost
PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=1 ttl=64 time=0.042 ms
64 bytes from localhost.localdomain (127.0.0.1): icmp_seq=2 ttl=64 time=0.062 ms
^C
--- localhost.localdomain ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1012ms
rtt min/avg/max/mdev = 0.042/0.052/0.062/0.010 ms

sh-4.4# systemctl status run-u1665.service
● run-u1665.service - /bin/sh
   Loaded: loaded (/run/systemd/transient/run-u1665.service; transient; vendor preset: disabled)
Transient: yes
   Active: active (running) since Thu 2017-09-07 20:11:58 CEST; 3min 22s ago
 Main PID: 15859 (sh)
       IP: 168B in, 168B out
    Tasks: 3 (limit: 4915)
   CGroup: /system.slice/run-u1665.service
           ├─15859 /bin/sh
           ├─15932 systemctl status run-u1665.service
           └─15933 less

Sep 07 20:11:58 sigma systemd[1]: Started /bin/sh.
sh-4.4# 

As expected, the IP: status line shows we generated some traffic. Now, let's see how the access list stuff works. Let's start from scratch:

# systemd-run -p IPAddressDeny=127.0.0.2 -t /bin/sh
Running as unit: run-u1720.service
Press ^] three times within 1s to disconnect TTY.
sh-4.4# 

This shell now can access all IP servers, except for 127.0.0.2 (which is on the local loopback). Let's try:

sh-4.4# ping -c 1 -W 5 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.091 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.091/0.091/0.091/0.000 ms
sh-4.4# ping -c 1 -W 5 127.0.0.2
PING 127.0.0.2 (127.0.0.2) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
^C
--- 127.0.0.2 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
sh-4.4# ping -c 1 -W 5 127.0.0.3
PING 127.0.0.3 (127.0.0.3) 56(84) bytes of data.
64 bytes from 127.0.0.3: icmp_seq=1 ttl=64 time=0.073 ms

--- 127.0.0.3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.073/0.073/0.073/0.000 ms
sh-4.4# 

As expected, traffic to 127.0.0.1 and 127.0.0.3 still works, but traffic to 127.0.0.2 doesn't! Yay!

@poettering poettering added the pid1 label Sep 7, 2017
@poettering poettering force-pushed the bpf-firewall branch 5 times, most recently from a655e5f to f508179 Compare September 8, 2017 12:58
Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I'm boarding a plane to Berlin now, so I won't be able to finish the review right now. But it's clear that significant reworking will be needed anyway. I'll look at the rest of the code in the afternoon.


assert(p);

if (!GREEDY_REALLOC(p->instructions, p->allocated, (p->n_instructions + count) * sizeof(struct bpf_insn)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GREEDY_REALLOC already multiplies by the size of each item.


DEFINE_TRIVIAL_CLEANUP_FUNC(BPFProgram*, bpf_program_unref);

/* The following is stolen from the kernel tree, samples/bpf/libbpf.h, which is under GPL2.0 */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but everything below in this file will need to be dropped, and independently coded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it's kinda an API, we define it in a .h header, and this is normally thought not to infect things license-wise... it doesn't encode algorithms, is too trivial for that... I mean, specifying the origin and its license should be fine

* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, this file cannot be present in the systemd sources. If we use a public header provided by the system, that's different. But including it in our sources is not possible w/o bumping our own license to full GPL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naah... we don't have to bump anything. As mentioned above I don#t think this is a real problem at all. But if indeed the header's copyright would matter, then we could still license our stuff as LGPL, though effectively it would be downgraded to GPL when you actually run it. But GPL vs LGPL only matters during runtime really, not in what we include in our tarball.

Or in other words, as long as we tell people that if they turn on the BPF stuff PID 1 will be effectively GPL it doesn't really matter at all...

I mean, quite frankly I wouldn't even have a problem at all with systemd being GPL rather than LGPL. The only reason we care so far, is that libsystemd should be linkable by non-free code, and we want the ability to copy/paste around our own code between libsystemd and PID 1 and other components freely without constantly thinking about the license...

IANAL, of course, but i don't think this is really a problem, at all. This is not code after all, just APIs...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and btw, some components of systemd are GPL rather than LPGL anyway, src/basic/ioprio.h, src/basic/securebits.h, and some stuff of udev iirc. That's why we ship LICENSE.GPL2 in the repo after all)

Copy link
Contributor

@topimiettinen topimiettinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but perhaps this could be committed separately. Is this needed by other patches?
edit: this applies to 8e0e6ec, sorry.

Copy link
Contributor

@topimiettinen topimiettinen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this could be useful regardless of the rest of patches.
edit: this applies to df1888c.

return r;
if (r == 0)
log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. "
"Proceeding WITHOUT firewalling in effect!", u->id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should cause the unit to fail instead? For example, a service which should only accept requests from localnet would be exposed to more hostile networks. Though it would mean that the unit file would need editing when used on hosts without BPF/cgroup firewalling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the reason I explicitly chose not to make this fatal is that I'd like to ship all our services which do not require internet access (which are effectively all except for the journal remoting stuff), with either IPAddressDeny=any alone, or IPAddressDeny=any + IPAddressAllow=localhost, so that no IP communication is possible at all, or only with 127.0.0.1. Now, if we do set this, then we don't want to break the services completely on kernels that don't support eBPF/cgroup stuff, and thus I just added the loud warning.

Note that IPAddressDeny=any is in many ways a better option that PrivateNetwork= even though it has very similar effects. That's because PrivateNetwork= affects AF_NETLINK and AF_UNIX (at least the abstract namespace), and that's pretty awful, as it breaks libudev and such. IPAddressDeny= is hence a great option for everything that needs libudev but still wants to prohibit networking.

if (r == 0)
log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. "
"Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

struct bpf_insn insn[] = {
/*
* If R0 == 0, the packet will be denied; skip the accounting instructions in this case.
* The jump label will be fixed up later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to keep a separate counter of the denied packets too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but let's maybe leave that for a later PR.

[ -f "$BUILDDIR"/build.ninja ] || meson "$BUILDDIR"
ninja -C "$BUILDDIR" all
[ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test
[ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test || ( RET="$?" ; cat "$BUILDDIR"/meson-logs/testlog.txt ; exit "$RET" )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe commit separately?

@topimiettinen
Copy link
Contributor

Overall very interesting feature. The filtering is at pretty basic level as only addresses can be used, not ports, packet flags, SELinux labels, connection states, interfaces etc. but perhaps this can be extended in the future (in moderation).

For example, systemd-timesyncd should be able to send UDP packets to NTP servers' ports but not connect with TCP to IRC botnets. The IP addresses of the servers may not be known in advance (and the current filter syntax does not allow name resolution to be used), so it's not possible to specify an address based filter either.


<para>In order to implement a whitelisting IP firewall, it is recommended to use a system-wide
<varname>IPAddressDeny=</varname><constant>any</constant> setting, plus individual per-unit
<varname>IPAddressAllow=</varname> lines permitting network access to relevant services, and only them.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places the user is warned about kernel feature availability:
Note that the implementation of this setting might be impossible (for example if user namespaces are not available), and the unit should be written in a way that does not solely rely on this setting for security.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add such a comment

@poettering
Copy link
Member Author

Looks good to me, but perhaps this could be committed separately. Is this needed by other patches?
edit: this applies to 8e0e6ec, sorry.

yes, test-bpf.c needs that to work

@poettering
Copy link
Member Author

Also this could be useful regardless of the rest of patches.
edit: this applies to df1888c.

Well, sure, but it also is necessary to make test-bpf.c

That said, even if it wasn't: we don't really insist on having strictly one separate PR for each topic, we permit some room to include not super-strictly related stuff in other PRs...)

@poettering
Copy link
Member Author

Overall very interesting feature. The filtering is at pretty basic level as only addresses can be used, not ports, packet flags, SELinux labels, connection states, interfaces etc. but perhaps this can be extended in the future (in moderation).

For example, systemd-timesyncd should be able to send UDP packets to NTP servers' ports but not connect with TCP to IRC botnets. The IP addresses of the servers may not be known in advance (and the current filter syntax does not allow name resolution to be used), so it's not possible to specify an address based filter either.

I really don't want to create a second iptables here, hence we should be really careful with adding too many finegrained controls. One of the major benefits of doing this kind of service-level firewalling is that we know more about the context of the traffic than iptables ever knows: we know that nginx.service is about HTTP and ntpd.service about NTP, hence we don't have to derive application context from the port number necessarily. This is particular great for protocols such as bittorrent where the port number is effectively random.

But yeah, it certainly is in the cards to extend this a bit later, but we should keep a focus on keeping this simple, and not make it a fully generic iptables-reeimplementation.

I am very much against doing name lookups for IP ACL configuration. DNS is relatively dynamic, and requires networking to work (except in exceptional cases), and doing networking in order to restrict networking is kinda contradictory...

Copy link
Member

@keszybz keszybz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patches look great. I found a few nitpicks here and there, mostly in docs, but nothing important. I looked over the BPF code, but it'd be good if somebody who knows BPF well looked at it.

With the headers — let me think a bit / read up on that.

const char *controller,
const char *path,
const char *event,
char **val){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before brace.


while (!IN_SET(SERVICE(u)->state, SERVICE_DEAD, SERVICE_FAILED)) {
assert_se(sd_event_run(m->event, UINT64_MAX) >= 0);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary braces.

<listitem><para>Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that
is applied to all units on the system. These settings operate similar to the per-unit options of the same name,
see
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please drop the comma before and. (Alternatively, the comma can stay, but then another one must be added before "are". But I think it's better to drop it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that duplicating the semantics of Deny/Allow here is unnecessary. It'd be better to just say:

Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that is applied to all units in the system. These settings are similar to the per-unit options of the same name, see <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. The lists configured with the global options and the lists configured per-unit are combined when applied, so the latter extend the former. By default global lists are empty, allowing access from all hosts. For details on the expected parameter syntax and how the allow/deny lists are applied, see the per-unit settings documentation linked above.

The other description is better, and repeated stuff tends to get out of sync...

<listitem>
<para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent
or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of
the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPv4 and IPv6 traffic

<para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent
or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of
the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6
associated with it (including both listening and connection sockets where this applies). Note that for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collected

if (r < 0)
return r;

if (an != FAMILY_ADDRESS_SIZE(family))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to include the size/family in the error message... Just a suggestion.

size_t size = 0;

if (n == 0) {
while (*list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so an empty list resets.

}

return first;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's somewhat tricky code. It'd be nice to add a test for this also. (And if not a test, then maybe just a TODO entry ;)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a test actually for this already, in test-bpf, though it's not obvious to see... we create a redundant ACL there which should be reduced to two entries... I'll add a comment clarifying that

if (r < 0)
return r;
if (r == 0)
log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put a newline between the sentences. We handle mulitline messages just fine and it's easier to read.

return r;
if (r == 0)
log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. "
"Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@topimiettinen
Copy link
Contributor

@poettering I don't think replicating iptables makes sense either. But IMHO the current level of functionality (address filters) is bit too simple to make much difference, I'd add protocol/port support to make it more useful.

Extending the functionality could take a different approach compared to iptables. For example, instead of adding more and more detailed filtering options, how about higher level switches like boolean IPDropUnusualPackets for filtering packets with exotic options?

@keszybz
Copy link
Member

keszybz commented Sep 11, 2017

On licensing:

  1. It is OK to use the (GPLv2-only) headers from the kernel in our code (LGPL). Ref.: http://lkml.iu.edu/hypermail/linux/kernel/0301.1/0362.html,

using structure definitions, typedefs, enumeration constants, macros with simple bodies, etc., is NOT enough to make a derivative work. It would take a substantial amount of code (coming from inline functions or macros with substantial bodies) to do that. [RMS]

  1. Copying the full header file into our repo is also fine, as long as we keep the copyright information and the original license. In effect, that file is under it's original license, and because of 1., the code that uses it can be separately licensed. This patch series does that with linux/bpf.h. Of course we have to be careful not to move any part of that file elsewhere in the repository or create derived work through other means.

  2. Copying parts of the header file and removing the original copyright information and license is OK only if the parts being copied are completely trivial. Using the definitions from samples/bpf/libbpf.h is OK, because of 1., since those are simple structure definitions without any code. But at the same time, the definition of those macros is not completely trivial, and (while it could be argued both ways) I think it's much better to stay on the safe side, and not strip it of the original copyright and license.

Please update the patch series to copy samples/bpf/libbpf.h in its entirety, similarly to bpf.h.

@poettering
Copy link
Member Author

@poettering I don't think replicating iptables makes sense either. But IMHO the current level of functionality (address filters) is bit too simple to make much difference, I'd add protocol/port support to make it more useful.

Well, this PR actually adds useful functionality to programs that were previously not available: IP ACLs for everything. Some programs support something like this, but certainly not all, and with this we add to everything in the service, unconditionally. This is is very different from filtering per IP port, or per IP protocol, or by "unusual packets". All that kind of filtering is essentially unnecessary and doesn't bring much new to the table: if your kernel chokes on "unusual packets", then something is wrong anyway, and if your app does TCP on port 4711 but is not supposed to do that, then your service is borked. That's not reason not to add additional restrictions on what port numbers and which protocols services can bind to, but it's in a very different territory: the IPAddressAllow=/IPAddressDeny= stuff adds new concepts where previously there was nothing there at all, while filtering by IP port/protocol just adds a second line of protection on something that should regularly not happen anyway...

Hence, I think yes, the IP ACLs this PR adds does make a major difference. It adds something very new, and very useful.

@boucman
Copy link
Contributor

boucman commented Sep 11, 2017

I have to say that, for embedded-system, being able to restrict easily all services to a unique IP address is nice...

settings. <varname>DefaultTasksAccounting=</varname> defaults
to on, the other three settings to off.</para></listitem>
for details on the per-unit settings. <varname>DefaultTasksAccounting=</varname> defaults to on, the other
four settings to off.</para></listitem>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As opposed to the other accounting options, IPAccounting= is not really hierarchical. Should a specific warning be added, for people who use systemd-nspawn or similar container tools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, according to Tejun there's work in progress to make things fully recursive and I am tempted to actually implement things that way already now, see further discussion below

<term><varname>DefaultBlockIOAccounting=</varname></term>
<term><varname>DefaultMemoryAccounting=</varname></term>
<term><varname>DefaultTasksAccounting=</varname></term>
<term><varname>DefaultIPAccouting</varname></term>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: IPAccounting
(same typo elsewhere)

and the socket unit are kept separate, and displayed separately, no propagation of the setting and the
collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket
unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even
if the socket is used by it. The system default for this setting may be controlled with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DefaultIOAccounting/DefaultIPAccounting/

collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket
unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even
if the socket is used by it. The system default for this setting may be controlled with
<varname>DefaultIOAccounting=</varname> in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: an explanation about how Delegate= might interfere with it.

@alban
Copy link
Member

alban commented Sep 11, 2017

We are working on a bpf program that firewall at the cgroup level and it would be nice if it could work well with systemd units. At the moment, we have the following race: our bpf program can only be installed after systemd has created the cgroup directory for the unit and if it is not fast enough, the application might already have sent/received network packets. The bpf filter needs to be installed between the fork and the exec.

So for my use-case, IPAccounting, IPAddressAllow and IPAddressDeny is not enough. I would like instead (or in addition) to have an option to specify an existing program:

IPIngressFilterBPF=/sys/fs/bpf/foo/myapp/my-filter
IPEgressFilterBPF=-/sys/fs/bpf/foo/myapp/my-filter2

So an external program could prepare a bpf program pinned on the bpf filesystem before the unit is started.

I would like to avoid Delegate= for the app because except the IP filter at the cgroup level, the app will not do other cgroup mangling.

Also, I would like to avoid using a global helper with ExecStartPre= because it would not work well with RootDirectory=, since the helper would be outside. And also, IPIngressFilterBPF would be more decoupled: if I replace my_firewall_cgroup application by my_other_firewall_cgroup I would not have to edit the ExecStartPre of every single unit file on the system as long as I keep the BPF programs pinned at the same location on the bpf filesystem.

With my solution, I don't need to have much BPF code in systemd: merely a couple of syscalls to retrieve the bpf program from the bpf filesystem and to install it on the cgroup. I would not need systemd to have any BPF assembly or to build a BPF program itself (but I understand it could be useful for other people's use cases).

/cc @iaguis @dongsupark @nhlfr

@poettering
Copy link
Member Author

IPIngressFilterBPF=/sys/fs/bpf/foo/myapp/my-filter IPEgressFilterBPF=-/sys/fs/bpf/foo/myapp/my-filter2

makes sense to me. That said I think the common "/sys/fs/bpf" prefix should not be mentioned: we should use a relative path here, in order not to create the illusion any path outside of /sys/fs/bpf could be used here... And should we decide one day, that we want to reuse this setting for ELF filters, than that still be compatible conceptually, because for that we could then use absolute paths still.

@poettering
Copy link
Member Author

Hmm, thinking more about that, maybe we should actually not permit configuration of any path at all actually, but just make these new options booleans and simply derive the path automatically from the unit name. i.e. if you want to install your own filter, then just name it properly in bpffs and set the bool and systemd would pick it up?

@poettering
Copy link
Member Author

I would like to avoid Delegate= for the app because except the IP filter at the cgroup level, the app will not do other cgroup mangling.

We really should add a property that permits per-unit selection of what to delegate btw: "DelegateConcept=io bpf cpu" or something like that that permits listeing cgroup controllers but also bpf...

@poettering
Copy link
Member Author

(anyway, the ability to enforce arbitrary foreign ebpf networking filters should probably be discussed in a separate issue and/or PR, let's leave this one for the IP accounting/filtering the suggested way)

@topimiettinen
Copy link
Contributor

All that kind of filtering is essentially unnecessary and doesn't bring much new to the table: if your kernel chokes on "unusual packets", then something is wrong anyway, and if your app does TCP on port 4711 but is not supposed to do that, then your service is borked.

I was thinking of the special kind of borked, where the service (e.g. systemd-timesyncd) suddenly starts trying to contact its new masters in botnet control network. I don't think that this could be blocked with system wide rules (or maybe SELinux could do it), if connecting to HTTPS ports is legitimate at system level. An application level firewall can be made much tighter than system wide rules can.

poettering and others added 20 commits September 22, 2017 15:24
…group first

Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny=
settings are in effect on all socket fds associated with it. In order to
make this happen we need to make sure the cgroup the fds are associated
with are the socket unit's cgroup. The only way to do that is invoking
socket()+accept() in them. Since we really don't want to migrate PID 1
around we do this by forking off a helper process, which invokes
socket()/accept() and sends the newly created fd to PID 1. Ugly, but
works, and there's apparently no better way right now.

This generalizes forking off per-unit helper processes in a new function
unit_fork_helper_process(), which is then also used by the NSS chown()
code of socket units.
Make sure the current IP accounting counters aren't lost during
reload/reexec.

Note that we destroy all BPF file objects during a reload: the BPF
programs, the access and the accounting maps. The former two need to be
regenerated anyway with the newly loaded configuration data, but the
latter one needs to survive reloads/reexec. In this implementation I
opted to only save/restore the accounting map content instead of the map
itself. While this opens a (theoretic) window where IP traffic is still
accounted to the old map after we read it out, and we thus miss a few
bytes this has the benefit that we can alter the map layout between
versions should the need arise.
With this change we'll invalidate all cgroup settings after coming back
from a daemon reload/reexec, so that the new settings are instantly
applied.

This is useful for the BPF case, because we don't serialize/deserialize
the BPF program fd, and hence have to install a new, updated BPF program
when coming back from the reload/reexec. However, this is also useful
for the rest of the cgroup settings, as it ensures that user
configuration really takes effect wherever we can.
We used to be a bit sloppy on this, and handed out accounting data even
for units where accounting wasn't explicitly enabled. Let's be stricter
here, so that we know the accounting data is actually fully valid. This
is necessary, as the accounting data is no longer stored exclusively in
cgroupfs, but is partly maintained external of that, and flushed during
unit starts. We should hence only expose accounting data we really know
is fully current.
…OB_RESULT=

So, currently, some of the structured log messages we generated based on
jobs carry the result in RESULT=, and others in JOB_RESULT=. Let's
streamline this, as stick to JOB_RESULT= in one place.

This is kind of an API break, but given that currently most software has
to check both fields anyway, I think we can get away with it.

Why unify on JOB_RESULT= rather than RESULT=? Well, we manage different
types of result codes in systemd. Most importanlty besides job results
there are also service results, and we should be explicit in what we
mean here.
This adds IOVEC_INIT() and IOVEC_MAKE() for initializing iovec structures
from a pointer and a size. On top of these IOVEC_INIT_STRING() and
IOVEC_MAKE_STRING() are added which take a string and automatically
determine the size of the string using strlen().

This patch removes the old IOVEC_SET_STRING() macro, given that
IOVEC_MAKE_STRING() is now useful for similar purposes. Note that the
old IOVEC_SET_STRING() invocations were two characters shorter than the
new ones using IOVEC_MAKE_STRING(), but I think the new syntax is more
readable and more generic as it simply resolves to a C99 literal
structure initialization. Moreover, we can use very similar syntax now
for initializing strings and pointer+size iovec entries. We canalso use
the new macros to initialize function parameters on-the-fly or array
definitions. And given that we shouldn't have so many ways to do the
same stuff, let's just settle on the new macros.

(This also converts some code to use _cleanup_ where dynamically
allocated strings were using IOVEC_SET_STRING() before, to modernize
things a bit)
keyring material should not leak into the container. So far we relied on
seccomp to deny access to the keyring, but given that we now made the
seccomp configurable, and access to keyctl() and friends may optionally
be permitted to containers now let's make sure we disconnect the callers
keyring from the keyring of PID 1 in the container.
…ournal

This adds a new recognizable log message for each unit invocation that
contains structured information about consumed resources of the unit as
a whole after it terminated. This is particular useful for apps that
want to figure out what the resource consumption of a unit given a
specific invocation ID was.

The log message is only generated for units that have at least one
XyzAccounting= property turned on, and currently only covers IP traffic and CPU
time metrics.
In times of seccomp it might very well fail, and given that we return
failures from this function anyway, let's also propagate getrlimit()
failures, just to be safe.
On current kernels BPF_MAP_TYPE_LPM_TRIE bpf maps are charged against
RLIMIT_MEMLOCK even for privileged users that have CAP_IPC_LOCK. Given
that mlock() generally ignores RLIMIT_MEMLOCK if CAP_IPC_LOCK is set
this appears to be an oversight in the kernel. Either way, until that's
fixed, let's just bump RLIMIT_MEMLOCK for the root user considerably, as
the default is quite limiting, and doesn't permit us to create more than
a few TRIE maps.
The <!-- --> comment lines resulted in double newlines in the man page
header, which looks quite ugly. Let's rearrange a bit so that these
comments don't result in changes in the output.
…gate=yes is set

Let's permit installing BPF programs in cgroup subtrees if
Delegeate=yes. Let's not document this precise behaviour for now though,
as most likely the logic here should become recursive, but that's only
going to happen if the kernel starts supporting that. Until then,
support this in a non-recursive fashion.
if (!a)
return log_oom();

if (streq(word, "any")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like some of this code should be refactored and integrated with bus_append_unit_property_assignment, but let's leave that for later.

if (r < 0)
return r;

if (!IN_SET(family, AF_INET, AF_INET6))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: IPAddressDeny= also.

if (r < 0)
return r;

if (prefixlen > FAMILY_ADDRESS_SIZE(family)*8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: prefixlen should be included in the message.

assert(u);
assert(ret);

cc = unit_get_cgroup_context(u);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add unit_has_tasks_accounting() or so.

}

if (sendmsg(journal_fd, &mh, MSG_NOSIGNAL) >= 0)
return -errno;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there seems to be some confusion about the return code here. Should be -error I think.

@keszybz keszybz merged commit 22c8321 into systemd:master Sep 26, 2017
poettering added a commit to poettering/systemd that referenced this pull request Sep 26, 2017
poettering added a commit to poettering/systemd that referenced this pull request Sep 26, 2017
Werkov pushed a commit to Werkov/systemd that referenced this pull request Jun 3, 2020
As suggested by @keszybz in the review of systemd#6764

(cherry picked from commit 4fe66c8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

8 participants