-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
add simple eBPF-based per-unit IP access lists and accounting #6764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a655e5f to
f508179
Compare
keszybz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... I'm boarding a plane to Berlin now, so I won't be able to finish the review right now. But it's clear that significant reworking will be needed anyway. I'll look at the rest of the code in the afternoon.
src/basic/bpf-program.c
Outdated
|
|
||
| assert(p); | ||
|
|
||
| if (!GREEDY_REALLOC(p->instructions, p->allocated, (p->n_instructions + count) * sizeof(struct bpf_insn))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GREEDY_REALLOC already multiplies by the size of each item.
src/basic/bpf-program.h
Outdated
|
|
||
| DEFINE_TRIVIAL_CLEANUP_FUNC(BPFProgram*, bpf_program_unref); | ||
|
|
||
| /* The following is stolen from the kernel tree, samples/bpf/libbpf.h, which is under GPL2.0 */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry, but everything below in this file will need to be dropped, and independently coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it's kinda an API, we define it in a .h header, and this is normally thought not to infect things license-wise... it doesn't encode algorithms, is too trivial for that... I mean, specifying the origin and its license should be fine
| * This program is free software; you can redistribute it and/or | ||
| * modify it under the terms of version 2 of the GNU General Public | ||
| * License as published by the Free Software Foundation. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, this file cannot be present in the systemd sources. If we use a public header provided by the system, that's different. But including it in our sources is not possible w/o bumping our own license to full GPL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naah... we don't have to bump anything. As mentioned above I don#t think this is a real problem at all. But if indeed the header's copyright would matter, then we could still license our stuff as LGPL, though effectively it would be downgraded to GPL when you actually run it. But GPL vs LGPL only matters during runtime really, not in what we include in our tarball.
Or in other words, as long as we tell people that if they turn on the BPF stuff PID 1 will be effectively GPL it doesn't really matter at all...
I mean, quite frankly I wouldn't even have a problem at all with systemd being GPL rather than LGPL. The only reason we care so far, is that libsystemd should be linkable by non-free code, and we want the ability to copy/paste around our own code between libsystemd and PID 1 and other components freely without constantly thinking about the license...
IANAL, of course, but i don't think this is really a problem, at all. This is not code after all, just APIs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and btw, some components of systemd are GPL rather than LPGL anyway, src/basic/ioprio.h, src/basic/securebits.h, and some stuff of udev iirc. That's why we ship LICENSE.GPL2 in the repo after all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but perhaps this could be committed separately. Is this needed by other patches?
edit: this applies to 8e0e6ec, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this could be useful regardless of the rest of patches.
edit: this applies to df1888c.
src/core/dbus-cgroup.c
Outdated
| return r; | ||
| if (r == 0) | ||
| log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. " | ||
| "Proceeding WITHOUT firewalling in effect!", u->id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should cause the unit to fail instead? For example, a service which should only accept requests from localnet would be exposed to more hostile networks. Though it would mean that the unit file would need editing when used on hosts without BPF/cgroup firewalling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the reason I explicitly chose not to make this fatal is that I'd like to ship all our services which do not require internet access (which are effectively all except for the journal remoting stuff), with either IPAddressDeny=any alone, or IPAddressDeny=any + IPAddressAllow=localhost, so that no IP communication is possible at all, or only with 127.0.0.1. Now, if we do set this, then we don't want to break the services completely on kernels that don't support eBPF/cgroup stuff, and thus I just added the loud warning.
Note that IPAddressDeny=any is in many ways a better option that PrivateNetwork= even though it has very similar effects. That's because PrivateNetwork= affects AF_NETLINK and AF_UNIX (at least the abstract namespace), and that's pretty awful, as it breaks libudev and such. IPAddressDeny= is hence a great option for everything that needs libudev but still wants to prohibit networking.
| if (r == 0) | ||
| log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. " | ||
| "Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
| struct bpf_insn insn[] = { | ||
| /* | ||
| * If R0 == 0, the packet will be denied; skip the accounting instructions in this case. | ||
| * The jump label will be fixed up later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to keep a separate counter of the denied packets too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but let's maybe leave that for a later PR.
| [ -f "$BUILDDIR"/build.ninja ] || meson "$BUILDDIR" | ||
| ninja -C "$BUILDDIR" all | ||
| [ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test | ||
| [ "$WITH_TESTS" = 0 ] || ninja -C "$BUILDDIR" test || ( RET="$?" ; cat "$BUILDDIR"/meson-logs/testlog.txt ; exit "$RET" ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe commit separately?
|
Overall very interesting feature. The filtering is at pretty basic level as only addresses can be used, not ports, packet flags, SELinux labels, connection states, interfaces etc. but perhaps this can be extended in the future (in moderation). For example, systemd-timesyncd should be able to send UDP packets to NTP servers' ports but not connect with TCP to IRC botnets. The IP addresses of the servers may not be known in advance (and the current filter syntax does not allow name resolution to be used), so it's not possible to specify an address based filter either. |
man/systemd.resource-control.xml
Outdated
|
|
||
| <para>In order to implement a whitelisting IP firewall, it is recommended to use a system-wide | ||
| <varname>IPAddressDeny=</varname><constant>any</constant> setting, plus individual per-unit | ||
| <varname>IPAddressAllow=</varname> lines permitting network access to relevant services, and only them.</para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places the user is warned about kernel feature availability:
Note that the implementation of this setting might be impossible (for example if user namespaces are not available), and the unit should be written in a way that does not solely rely on this setting for security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add such a comment
yes, test-bpf.c needs that to work |
Well, sure, but it also is necessary to make test-bpf.c That said, even if it wasn't: we don't really insist on having strictly one separate PR for each topic, we permit some room to include not super-strictly related stuff in other PRs...) |
I really don't want to create a second iptables here, hence we should be really careful with adding too many finegrained controls. One of the major benefits of doing this kind of service-level firewalling is that we know more about the context of the traffic than iptables ever knows: we know that nginx.service is about HTTP and ntpd.service about NTP, hence we don't have to derive application context from the port number necessarily. This is particular great for protocols such as bittorrent where the port number is effectively random. But yeah, it certainly is in the cards to extend this a bit later, but we should keep a focus on keeping this simple, and not make it a fully generic iptables-reeimplementation. I am very much against doing name lookups for IP ACL configuration. DNS is relatively dynamic, and requires networking to work (except in exceptional cases), and doing networking in order to restrict networking is kinda contradictory... |
keszybz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The patches look great. I found a few nitpicks here and there, mostly in docs, but nothing important. I looked over the BPF code, but it'd be good if somebody who knows BPF well looked at it.
With the headers — let me think a bit / read up on that.
src/basic/cgroup-util.c
Outdated
| const char *controller, | ||
| const char *path, | ||
| const char *event, | ||
| char **val){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space before brace.
src/test/test-bpf.c
Outdated
|
|
||
| while (!IN_SET(SERVICE(u)->state, SERVICE_DEAD, SERVICE_FAILED)) { | ||
| assert_se(sd_event_run(m->event, UINT64_MAX) >= 0); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary braces.
man/systemd-system.conf.xml
Outdated
| <listitem><para>Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that | ||
| is applied to all units on the system. These settings operate similar to the per-unit options of the same name, | ||
| see | ||
| <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please drop the comma before and. (Alternatively, the comma can stay, but then another one must be added before "are". But I think it's better to drop it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think that duplicating the semantics of Deny/Allow here is unnecessary. It'd be better to just say:
Configure system-wide access lists of IPv4 or IPv6 addresses (each with optional prefix lengths) that is applied to all units in the system. These settings are similar to the per-unit options of the same name, see
<citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry>. The lists configured with the global options and the lists configured per-unit are combined when applied, so the latter extend the former. By default global lists are empty, allowing access from all hosts. For details on the expected parameter syntax and how the allow/deny lists are applied, see the per-unit settings documentation linked above.
The other description is better, and repeated stuff tends to get out of sync...
man/systemd.resource-control.xml
Outdated
| <listitem> | ||
| <para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent | ||
| or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of | ||
| the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPv4 and IPv6 traffic
| <para>Takes a boolean argument. If true, turns on IPv4 and IPv6 network traffic accounting for packets sent | ||
| or received by the unit. When this option is turned on, all IPv4 and IPv6 sockets created by any process of | ||
| the unit are accounted for. When this option is used in socket units, it applies to all IPv4 and IPv6 | ||
| associated with it (including both listening and connection sockets where this applies). Note that for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collected
| if (r < 0) | ||
| return r; | ||
|
|
||
| if (an != FAMILY_ADDRESS_SIZE(family)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to include the size/family in the error message... Just a suggestion.
src/core/dbus-cgroup.c
Outdated
| size_t size = 0; | ||
|
|
||
| if (n == 0) { | ||
| while (*list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so an empty list resets.
| } | ||
|
|
||
| return first; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's somewhat tricky code. It'd be nice to add a test for this also. (And if not a test, then maybe just a TODO entry ;)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a test actually for this already, in test-bpf, though it's not obvious to see... we create a redundant ACL there which should be reduced to two entries... I'll add a comment clarifying that
src/core/dbus-cgroup.c
Outdated
| if (r < 0) | ||
| return r; | ||
| if (r == 0) | ||
| log_warning("Transient unit %s configures an IP firewall, but the local system does not support BPF/cgroup firewalling. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put a newline between the sentences. We handle mulitline messages just fine and it's easier to read.
src/core/ip-address-access.c
Outdated
| return r; | ||
| if (r == 0) | ||
| log_warning("File %s:%u configures an IP firewall (%s=%s), but the local system does not support BPF/cgroup based firewalling. " | ||
| "Proceeding WITHOUT firewalling in effect!", filename, line, lvalue, rvalue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
|
@poettering I don't think replicating iptables makes sense either. But IMHO the current level of functionality (address filters) is bit too simple to make much difference, I'd add protocol/port support to make it more useful. Extending the functionality could take a different approach compared to iptables. For example, instead of adding more and more detailed filtering options, how about higher level switches like boolean |
|
On licensing:
Please update the patch series to copy |
Well, this PR actually adds useful functionality to programs that were previously not available: IP ACLs for everything. Some programs support something like this, but certainly not all, and with this we add to everything in the service, unconditionally. This is is very different from filtering per IP port, or per IP protocol, or by "unusual packets". All that kind of filtering is essentially unnecessary and doesn't bring much new to the table: if your kernel chokes on "unusual packets", then something is wrong anyway, and if your app does TCP on port 4711 but is not supposed to do that, then your service is borked. That's not reason not to add additional restrictions on what port numbers and which protocols services can bind to, but it's in a very different territory: the IPAddressAllow=/IPAddressDeny= stuff adds new concepts where previously there was nothing there at all, while filtering by IP port/protocol just adds a second line of protection on something that should regularly not happen anyway... Hence, I think yes, the IP ACLs this PR adds does make a major difference. It adds something very new, and very useful. |
|
I have to say that, for embedded-system, being able to restrict easily all services to a unique IP address is nice... |
| settings. <varname>DefaultTasksAccounting=</varname> defaults | ||
| to on, the other three settings to off.</para></listitem> | ||
| for details on the per-unit settings. <varname>DefaultTasksAccounting=</varname> defaults to on, the other | ||
| four settings to off.</para></listitem> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As opposed to the other accounting options, IPAccounting= is not really hierarchical. Should a specific warning be added, for people who use systemd-nspawn or similar container tools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, according to Tejun there's work in progress to make things fully recursive and I am tempted to actually implement things that way already now, see further discussion below
man/systemd-system.conf.xml
Outdated
| <term><varname>DefaultBlockIOAccounting=</varname></term> | ||
| <term><varname>DefaultMemoryAccounting=</varname></term> | ||
| <term><varname>DefaultTasksAccounting=</varname></term> | ||
| <term><varname>DefaultIPAccouting</varname></term> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: IPAccounting
(same typo elsewhere)
man/systemd.resource-control.xml
Outdated
| and the socket unit are kept separate, and displayed separately, no propagation of the setting and the | ||
| collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket | ||
| unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even | ||
| if the socket is used by it. The system default for this setting may be controlled with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DefaultIOAccounting/DefaultIPAccounting/
man/systemd.resource-control.xml
Outdated
| collected data is done, in neither direction. Moreover, any traffic sent or received on any of the socket | ||
| unit's sockets is accounted to the socket unit — and never to the service unit it might have activated, even | ||
| if the socket is used by it. The system default for this setting may be controlled with | ||
| <varname>DefaultIOAccounting=</varname> in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: an explanation about how Delegate= might interfere with it.
|
We are working on a bpf program that firewall at the cgroup level and it would be nice if it could work well with systemd units. At the moment, we have the following race: our bpf program can only be installed after systemd has created the cgroup directory for the unit and if it is not fast enough, the application might already have sent/received network packets. The bpf filter needs to be installed between the So for my use-case, So an external program could prepare a bpf program pinned on the bpf filesystem before the unit is started. I would like to avoid Also, I would like to avoid using a global helper with With my solution, I don't need to have much BPF code in systemd: merely a couple of syscalls to retrieve the bpf program from the bpf filesystem and to install it on the cgroup. I would not need systemd to have any BPF assembly or to build a BPF program itself (but I understand it could be useful for other people's use cases). /cc @iaguis @dongsupark @nhlfr |
makes sense to me. That said I think the common "/sys/fs/bpf" prefix should not be mentioned: we should use a relative path here, in order not to create the illusion any path outside of /sys/fs/bpf could be used here... And should we decide one day, that we want to reuse this setting for ELF filters, than that still be compatible conceptually, because for that we could then use absolute paths still. |
|
Hmm, thinking more about that, maybe we should actually not permit configuration of any path at all actually, but just make these new options booleans and simply derive the path automatically from the unit name. i.e. if you want to install your own filter, then just name it properly in bpffs and set the bool and systemd would pick it up? |
We really should add a property that permits per-unit selection of what to delegate btw: "DelegateConcept=io bpf cpu" or something like that that permits listeing cgroup controllers but also bpf... |
|
(anyway, the ability to enforce arbitrary foreign ebpf networking filters should probably be discussed in a separate issue and/or PR, let's leave this one for the IP accounting/filtering the suggested way) |
I was thinking of the special kind of borked, where the service (e.g. systemd-timesyncd) suddenly starts trying to contact its new masters in botnet control network. I don't think that this could be blocked with system wide rules (or maybe SELinux could do it), if connecting to HTTPS ports is legitimate at system level. An application level firewall can be made much tighter than system wide rules can. |
…s_family() only once
…group first Let's make sure that a socket unit's IPAddressAllow=/IPAddressDeny= settings are in effect on all socket fds associated with it. In order to make this happen we need to make sure the cgroup the fds are associated with are the socket unit's cgroup. The only way to do that is invoking socket()+accept() in them. Since we really don't want to migrate PID 1 around we do this by forking off a helper process, which invokes socket()/accept() and sends the newly created fd to PID 1. Ugly, but works, and there's apparently no better way right now. This generalizes forking off per-unit helper processes in a new function unit_fork_helper_process(), which is then also used by the NSS chown() code of socket units.
Make sure the current IP accounting counters aren't lost during reload/reexec. Note that we destroy all BPF file objects during a reload: the BPF programs, the access and the accounting maps. The former two need to be regenerated anyway with the newly loaded configuration data, but the latter one needs to survive reloads/reexec. In this implementation I opted to only save/restore the accounting map content instead of the map itself. While this opens a (theoretic) window where IP traffic is still accounted to the old map after we read it out, and we thus miss a few bytes this has the benefit that we can alter the map layout between versions should the need arise.
With this change we'll invalidate all cgroup settings after coming back from a daemon reload/reexec, so that the new settings are instantly applied. This is useful for the BPF case, because we don't serialize/deserialize the BPF program fd, and hence have to install a new, updated BPF program when coming back from the reload/reexec. However, this is also useful for the rest of the cgroup settings, as it ensures that user configuration really takes effect wherever we can.
We used to be a bit sloppy on this, and handed out accounting data even for units where accounting wasn't explicitly enabled. Let's be stricter here, so that we know the accounting data is actually fully valid. This is necessary, as the accounting data is no longer stored exclusively in cgroupfs, but is partly maintained external of that, and flushed during unit starts. We should hence only expose accounting data we really know is fully current.
…in ENVIRONMENT.md
…OB_RESULT= So, currently, some of the structured log messages we generated based on jobs carry the result in RESULT=, and others in JOB_RESULT=. Let's streamline this, as stick to JOB_RESULT= in one place. This is kind of an API break, but given that currently most software has to check both fields anyway, I think we can get away with it. Why unify on JOB_RESULT= rather than RESULT=? Well, we manage different types of result codes in systemd. Most importanlty besides job results there are also service results, and we should be explicit in what we mean here.
This adds IOVEC_INIT() and IOVEC_MAKE() for initializing iovec structures from a pointer and a size. On top of these IOVEC_INIT_STRING() and IOVEC_MAKE_STRING() are added which take a string and automatically determine the size of the string using strlen(). This patch removes the old IOVEC_SET_STRING() macro, given that IOVEC_MAKE_STRING() is now useful for similar purposes. Note that the old IOVEC_SET_STRING() invocations were two characters shorter than the new ones using IOVEC_MAKE_STRING(), but I think the new syntax is more readable and more generic as it simply resolves to a C99 literal structure initialization. Moreover, we can use very similar syntax now for initializing strings and pointer+size iovec entries. We canalso use the new macros to initialize function parameters on-the-fly or array definitions. And given that we shouldn't have so many ways to do the same stuff, let's just settle on the new macros. (This also converts some code to use _cleanup_ where dynamically allocated strings were using IOVEC_SET_STRING() before, to modernize things a bit)
keyring material should not leak into the container. So far we relied on seccomp to deny access to the keyring, but given that we now made the seccomp configurable, and access to keyctl() and friends may optionally be permitted to containers now let's make sure we disconnect the callers keyring from the keyring of PID 1 in the container.
…ournal This adds a new recognizable log message for each unit invocation that contains structured information about consumed resources of the unit as a whole after it terminated. This is particular useful for apps that want to figure out what the resource consumption of a unit given a specific invocation ID was. The log message is only generated for units that have at least one XyzAccounting= property turned on, and currently only covers IP traffic and CPU time metrics.
In times of seccomp it might very well fail, and given that we return failures from this function anyway, let's also propagate getrlimit() failures, just to be safe.
On current kernels BPF_MAP_TYPE_LPM_TRIE bpf maps are charged against RLIMIT_MEMLOCK even for privileged users that have CAP_IPC_LOCK. Given that mlock() generally ignores RLIMIT_MEMLOCK if CAP_IPC_LOCK is set this appears to be an oversight in the kernel. Either way, until that's fixed, let's just bump RLIMIT_MEMLOCK for the root user considerably, as the default is quite limiting, and doesn't permit us to create more than a few TRIE maps.
The <!-- --> comment lines resulted in double newlines in the man page header, which looks quite ugly. Let's rearrange a bit so that these comments don't result in changes in the output.
…gate=yes is set Let's permit installing BPF programs in cgroup subtrees if Delegeate=yes. Let's not document this precise behaviour for now though, as most likely the logic here should become recursive, but that's only going to happen if the kernel starts supporting that. Until then, support this in a non-recursive fashion.
65864f5 to
22c8321
Compare
| if (!a) | ||
| return log_oom(); | ||
|
|
||
| if (streq(word, "any")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some of this code should be refactored and integrated with bus_append_unit_property_assignment, but let's leave that for later.
| if (r < 0) | ||
| return r; | ||
|
|
||
| if (!IN_SET(family, AF_INET, AF_INET6)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: IPAddressDeny= also.
| if (r < 0) | ||
| return r; | ||
|
|
||
| if (prefixlen > FAMILY_ADDRESS_SIZE(family)*8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: prefixlen should be included in the message.
| assert(u); | ||
| assert(ret); | ||
|
|
||
| cc = unit_get_cgroup_context(u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: add unit_has_tasks_accounting() or so.
| } | ||
|
|
||
| if (sendmsg(journal_fd, &mh, MSG_NOSIGNAL) >= 0) | ||
| return -errno; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there seems to be some confusion about the return code here. Should be -error I think.
As suggested by @keszybz in the review of systemd#6764
As suggested by @keszybz in the review of systemd#6764
As suggested by @keszybz in the review of systemd#6764 (cherry picked from commit 4fe66c8)
This adds simple eBPF-based based per-unit IP access lists and accounting. It makes use of the ebpf/cgroup feature of the unified cgroup hierarchy in new kernels.
This is mostly a heavily extended version of @zonque's original userspace work, updated to the kernel APIs ultimately merged into the Linux kernel.
Here's the gist how you use it. Let's start an in interactive shell as transient unit, with IP accounting enabled:
Now, let's have a look at the unit we just started, from inside of it:
Note the new status lines
IP:, showing that so far no traffic was generated. Let's change that, by generating some traffic from inside the unit:As expected, the
IP:status line shows we generated some traffic. Now, let's see how the access list stuff works. Let's start from scratch:This shell now can access all IP servers, except for 127.0.0.2 (which is on the local loopback). Let's try:
As expected, traffic to 127.0.0.1 and 127.0.0.3 still works, but traffic to 127.0.0.2 doesn't! Yay!