Add RestrictFileSystems= property using LSM BPF #18145

iaguis · 2021-01-06T12:52:20Z

This PR adds the RestrictFileSystems= property. When used, processes
belonging to a service are only able to access the filesystems listed in the
property.

This is implemented by attaching a BPF program to the file_open BPF LSM hook.
The program is attached at boot time and stays there forever. Then, when a
service specifying the RestrictFileSystems= property is started, an entry is
added to a global hash of maps BPF map. The map stores a set of filesystem
magic numbers per cgroupID. When a process tries to open a file, the BPF
program is executed and checks the cgroup the process is running in: if an
entry is present in the global map it checks if the filesystem the process is
trying to access is present in the set, if not, it denies access to it.

RestrictFileSystems= is only supported on systems with the LSM BPF hook
enabled and using cgroup2 (unified or hybrid).

This PR makes use of the libbpf framework introduced in #17655. Same as that PR,
it requires clang, llvm and libbpf at compile time, and the
libbpf shared library is optional at runtime (uses dlopen).

Thanks to the usage of libbpf, the program can use the CO-RE (Compile-Once
Run-Everywhere) technology so it doesn't require kernel headers at runtime to
access internal kernel structures.

boucman

I'm not a systemd devs, so take all my remarks as suggestions..

boucman · 2021-01-06T14:29:13Z

man/systemd.exec.xml

+
+        <listitem><para>Restricts the set of filesystems processes of this unit can open files on. Takes a space-separated
+        list of filesystem names. Any filesystem listed is made accessible to the unit's processes, access to filesystem
+        types not listed is prohibited (allow-listing). If the empty string is assigned, access to filesystems is not


would it be possible to add the negation syntax ?
The first use case I can think of is denying access to fat/vfat to some processes and I might not know the kind of filesystem the rootfs itself is on..

Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull

would it be possible to add the negation syntax ?

The current BPF implementation doesn't support this but it shouldn't be too difficult to support. I agree this would be nice.

Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull

Not sure this is too useful since it will only support one filesystem, the negation syntax is probably more flexible and should cover the same use case although the user would have to specify the filesystem type. We could also make it a list of directories but it feels weird to me.

boucman · 2021-01-06T14:29:53Z

man/org.freedesktop.systemd1.xml

    <!--property RestrictNamespaces is not documented!-->

+    <!--property RestrictFileSystems is not documented!-->
+


Please add documentation for the dbus API too

I know it's new and everything, but let's try to take the good habits from the start

Didn't realize this was new, I thought not documenting it was a conscious decision. Will do.

boucman · 2021-01-06T14:30:56Z

man/systemd.exec.xml

+
+        <para>Note that this setting might not be supported on some systems (for example if the LSM eBPF hook is
+        not enabled in the underlying kernel or if not using the unified control group hierarchy). This setting
+        will fail in that case.</para>


Fail as in "deny access" or as in "not implementing the security"
The general systemd philosophy is that if a feature is not available, things should just work. so we probably want the latter

In this case I meant fail as in "the service won't start and will print an error". I chose this because this is a security feature and I didn't want to give a false sense of security. However you're right that in general things just work when they're not supported, I'm on the fence here.

IIRC all the seccomp directives and seccomp directives are also best effort - if you build without seccomp/selinux support, they are simply ignored. I think the same pattern should be used - doesn't make much sense to diverge

yes... the other reason to do that is so that the same .service can be used on any distribution/setup. i.e distributions can use upstream provided services without haing to worry what is or isn't supported in their particular kernel

boucman · 2021-01-06T14:32:29Z

src/basic/stat-util.c

+        else if (streq(name, "zonefs"))
+                *ret = ZONEFS_MAGIC;
+        else if (streq(name, "zsmalloc"))
+                *ret = ZSMALLOC_MAGIC;


How would a user find that list of magic names ? is that the same one as the one the "mount" comand use ?
how to make that future-proof wrt futre filesystems or out-of-tree filesystems ? is there a better way to do that if/elsif list ?

This is a good point, I've used the names in the file_system_type.name struct in the kernel, I assume they're the same ones the "mount" command uses but I'm not sure.

I don't have good ideas on how to make it future proof, I don't think there's any API to get the magic number for a name so it'd have to be something manual.

we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.

poettering

too late today or a full review, but some early comments

poettering · 2021-01-06T19:25:27Z

src/basic/cgroup-util.c

+        assert(path);
+        assert(ret);
+
+        h = malloc0(offsetof(struct file_handle, f_handle) + sizeof(uint64_t));


this is fixed size and small afaics, allocate on stack plz

poettering · 2021-01-06T19:33:52Z

src/basic/stat-util.c

+        assert(name);
+        assert(ret);
+
+        if (streq(name, "apparmorfs"))


can't we generate this automatically from linux/magic.h (plus a few manual additions)? or at least have some tool that validates on build that everything in linux/magic.h is also listed here?

also, please use gperf for this, see errno-from-gperf.h and such

I'll have a look.

poettering · 2021-01-11T20:31:24Z

Note the BPF LSM hook is very new and is not enabled in current distro kernels AFAIK.

Hmm, pity. Is this tracked anywhere? Did anyone ever file a bug against let's say Fedora asking for this to be added yet?

poettering · 2021-01-11T20:33:29Z

i finally reviewed @wat-ze-hex's PR #16655 the other day, the commits stemming from there i'll not re-review here.

poettering

great work!

would love @wat-ze-hex input/review on this one!

poettering · 2021-01-11T20:34:33Z

src/shared/bpf-object.c

+        _prog = bpf_object__find_program_by_title(object, title);
+        if (!_prog) {
+                return -EINVAL;
+        }


our usual coding style is not to place {} around single-line if blocks

poettering · 2021-01-11T20:35:24Z

src/shared/bpf-object.c

+        _cleanup_free_ struct bpf_program *_prog = NULL;
+
+        assert(ret_prog);
+        assert(*ret_prog == NULL);


our usual coding style assumes output parameters can be in undefined state, i.e. we don't vlidate output params on input, i.e. please drop this line

poettering · 2021-01-11T20:38:25Z

src/basic/stat-util.c

+        else if (streq(name, "zonefs"))
+                *ret = ZONEFS_MAGIC;
+        else if (streq(name, "zsmalloc"))
+                *ret = ZSMALLOC_MAGIC;


we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.

poettering · 2021-01-11T20:40:01Z

src/basic/stat-util.c

        return 0;
 }
+
+int fs_type_from_string(const char *name, uint32_t *ret) {


uint32_t → statfs_f_type_t I figure?

the type exposed by statfs() is signed on some archs, and unsigned on other archs. but some defs use the MSB. it's a total mess.

poettering · 2021-01-11T20:41:15Z

src/core/bpf/restrict_fs/restrict-fs.c

@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */


as in the the other PR, maybe we should name these .c files *.restricted.c or so? or *.bpf.c or so? i.e. indicate in the name that this is weird, special C code...

poettering · 2021-01-11T21:18:48Z

src/test/test-bpf-lsm.c

+                        return log_unit_error_errno(u, r, "Failed to parse RestrictFileSystems: %m");
+        }
+
+        exec_start = strjoin("cat ", file_path);


assert_se() around this please, to catch oom conditions

poettering · 2021-01-11T21:19:43Z

src/test/test-bpf-lsm.c

+
+        cld_code = SERVICE(u)->exec_command[SERVICE_EXEC_START]->exec_status.code;
+        if (cld_code != CLD_EXITED) {
+                r = -SYNTHETIC_ERRNO(EBUSY);


SYNTHETIC_ERRNO() exists purely to pass some extra info to log_error_errno(), it should only be used around the first arg of log_error_errno() (and related calls), but never really assigned to anything else. please move this into the first arg of the function call hence.

poettering · 2021-01-11T21:19:49Z

src/test/test-bpf-lsm.c

+        }
+
+        if (SERVICE(u)->state != SERVICE_DEAD) {
+                r = -SYNTHETIC_ERRNO(EBUSY);


poettering · 2021-01-11T21:19:59Z

src/test/test-bpf-lsm.c

+
+        assert_se(getrlimit(RLIMIT_MEMLOCK, &rl) >= 0);
+        rl.rlim_cur = rl.rlim_max = MAX(rl.rlim_max, CAN_MEMLOCK_SIZE);
+        (void) setrlimit(RLIMIT_MEMLOCK, &rl);


setrlimit_closest()

poettering · 2021-01-11T21:20:24Z

src/test/test-bpf-lsm.c

+        assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("btrfs", "ext4")) < 0);
+        assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("debugfs", "btrfs", "ext4")) >= 0);
+
+        return 0;


i figure most of the comments I wrote to @wat-ze-hex' test PR also apply here

poettering · 2021-01-11T21:21:46Z

also needs rebase

bluca · 2021-01-12T00:25:27Z

Note the BPF LSM hook is very new and is not enabled in current distro kernels AFAIK.

Hmm, pity. Is this tracked anywhere? Did anyone ever file a bug against let's say Fedora asking for this to be added yet?

I'll start with Debian: https://salsa.debian.org/kernel-team/linux/-/merge_requests/306
There's already a bug for Ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1905975

mauriciovasquezbernal · 2021-01-19T00:58:32Z

src/core/load-fragment.c

+                void *userdata) {
+        _cleanup_strv_free_ char **n = NULL;
+        size_t nlen = 0, nbufsize = 0;
+        char*** restrict_fs = data;


Suggested change

char*** restrict_fs = data;

char ***restrict_fs = data;

?

mauriciovasquezbernal · 2021-01-21T19:44:19Z

man/systemd.exec.xml

+        restricted. This option may appear more than once, in which case the namespace
+        types are merged by <constant>OR</constant>.</para>


Namespace types?

mauriciovasquezbernal · 2021-02-01T13:50:09Z

src/core/bpf/restrict_fs/restrict-fs.c

+int BPF_PROG(restrict_filesystems, struct file *file, int ret)
+{
+        unsigned long magic_number;
+        unsigned long tmpfs_magic;


Unused variable, right?

lgtm-com · 2021-02-11T16:42:04Z

This pull request introduces 1 alert when merging 4e4a3a5 into 1c3c43a - view on LGTM.com

new alerts:

1 for Unused import

iaguis · 2021-02-11T17:37:39Z

I've addressed (I think) all review comments. Some high level changes:

RestrictFileSystems= accepts either an allow list or a deny list similar to SystemCallFilter.
I use gperf to generate the filesystem database.
I've added some filesystem groups as suggested and used those in existing functions.
- I've added a systemd-analyze filesystems command to discover those groups.
I don't generate the filesystem list automatically by parsing magic.h because filesystem names are kinda arbitrary, but I've added a script that checks all filesystems defined in magic.h are present in the database.

Also, I was wondering if we should split @wat-ze-hex's commits to a separate PR now that there are 3 PRs using it (#17655, #18385, and this one). It might make things easier to merge.

iaguis · 2021-10-05T15:34:35Z

Fixed those two nits :)

poettering · 2021-10-06T08:41:52Z

was about to merge, but needs a rebase i fear.

It fits better there.

They were failing on CI.

Stores filesystem_name -> magic_number(s).

It hooks into the file_open LSM hook and allows only when the filesystem where the open will take place is present in a BPF map for a particular cgroup. The BPF map used is a hash of maps with the following structure: cgroupID -> (s_magic -> uint32) The inner map is effectively a set. The entry at key 0 in the inner map encodes whether the program behaves as an allow list or a deny list: if its value is 0 it is a deny list, otherwise it is an allow list. When the cgroupID is present in the map, the program checks the inner map for the magic number of the filesystem associated with the file that's being opened. When the program behaves as an allow list, if that magic number is present it allows the open to succeed, when the program behaves as a deny list, it only allows access if the that magic number is NOT present. When access is denied the program returns -EPERM. The BPF program uses CO-RE (Compile-Once Run-Everywhere) to access internal kernel structures without needing kernel headers present at runtime.

It returns the cgroupID from a cgroup path.

It will be used later.

They're needed for the LSM BPF feature.

This adds 6 functions to implement RestrictFileSystems= * lsm_bpf_supported() checks if LSM BPF is supported. It checks that cgroupv2 is used, that BPF LSM is enabled, and tries to load the BPF LSM program which makes sure BTF and hash of maps are supported, and BPF LSM programs can be loaded. * lsm_bpf_setup() loads and attaches the LSM BPF program. * lsm_bpf_unit_restrict_filesystems() populates the hash of maps BPF map with the cgroupID and the set of allowed or denied filesystems. * lsm_bpf_cleanup() removes a cgroupID entry from the hash of maps. * lsm_bpf_map_restrict_fs_fd() is a helper function to get the file descriptor of the BPF map. * lsm_bpf_destroy() is a wrapper around the destroy function of the BPF skeleton file.

It attaches the LSM BPF program when the system manager starts up. It populates the hash of maps BPF map when services that have RestrictFileSystems= set start. It cleans up the hash of maps when the unit cgroup is pruned. To pass the file descriptor of the BPF map we add it to the keep_fds array.

It takes an allow or deny list of filesystems services should have access to.

For distros that ship libbpf >=0.2.0.

poettering · 2021-10-06T10:07:44Z

lgtm, thanks!

poettering · 2021-10-06T14:23:21Z

CI failures appear unrelated. Merging! Thanks!

github-actions bot added the mkosi label Jan 6, 2021

iaguis mentioned this pull request Jan 6, 2021

Introduce SocketBind{Allow|Deny}= properties powered by source compiled BPF #17655

Merged

boucman reviewed Jan 6, 2021

View reviewed changes

iaguis force-pushed the iaguis/lsm-bpf branch 2 times, most recently from 6a2525d to abba9c5 Compare January 6, 2021 16:29

poettering reviewed Jan 6, 2021

View reviewed changes

poettering added bpf pid1 labels Jan 9, 2021

poettering requested changes Jan 11, 2021

View reviewed changes

poettering added needs-rebase reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jan 11, 2021

mauriciovasquezbernal reviewed Jan 19, 2021

View reviewed changes

Base automatically changed from master to main January 21, 2021 11:55

mauriciovasquezbernal reviewed Jan 21, 2021

View reviewed changes

mauriciovasquezbernal reviewed Feb 1, 2021

View reviewed changes

iaguis force-pushed the iaguis/lsm-bpf branch from abba9c5 to a907bcd Compare February 11, 2021 16:07

github-actions bot added the systemctl label Feb 11, 2021

iaguis force-pushed the iaguis/lsm-bpf branch 3 times, most recently from 3f08846 to 4e4a3a5 Compare February 11, 2021 16:32

iaguis force-pushed the iaguis/lsm-bpf branch 3 times, most recently from 8d7f0b7 to 4839871 Compare February 11, 2021 17:37

iaguis requested a review from poettering February 11, 2021 17:39

poettering added needs-rebase and removed good-to-merge/with-minor-suggestions labels Oct 6, 2021

iaguis added 19 commits October 6, 2021 10:48

basic: move CIFS magic number to missing_magic.h

2ac5f90

It fits better there.

missing_magic: add several filesystems

3ef4e91

They were failing on CI.

basic: add filesystem database

1315ce3

Stores filesystem_name -> magic_number(s).

basic: use filesystem database

659d192

cgroup-util: add cg_path_get_cgroupid()

535e3dd

It returns the cgroupID from a cgroup path.

exit-status: add EXIT_BPF

d13b60d

It will be used later.

shared/bpf-dlopen: expose more libbpf functions

510cdbe

They're needed for the LSM BPF feature.

core: add RestrictFileSystems= fragment parser

e59ccd0

It takes an allow or deny list of filesystems services should have access to.

core: add dbus RestrictFileSystems= properties

cc86a27

mkosi: add libbpf dependency

af11239

For distros that ship libbpf >=0.2.0.

man: add RestrictFileSystems= documentation

a6826f6

man: document EXIT_BPF status

d6d6f55

test: add test-bpf-lsm

8216741

README: document LSM BPF requirements

ec31dd5

analyze: add filesystems command

b41711c

man: document systemd-analyze filesystems

2008062

iaguis force-pushed the iaguis/lsm-bpf branch from 7003397 to 2008062 Compare October 6, 2021 08:52

bluca added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed needs-rebase labels Oct 6, 2021

poettering merged commit 9a1ddc8 into systemd:main Oct 6, 2021

iaguis deleted the iaguis/lsm-bpf branch October 6, 2021 14:52

		<!--property RestrictNamespaces is not documented!-->

		<!--property RestrictFileSystems is not documented!-->

		@@ -0,0 +1,59 @@
		/* SPDX-License-Identifier: GPL-2.0-or-later */

		restricted. This option may appear more than once, in which case the namespace
		types are merged by <constant>OR</constant>.</para>

Uh oh!

Add RestrictFileSystems= property using LSM BPF #18145

Add RestrictFileSystems= property using LSM BPF #18145

Uh oh!

Conversation

iaguis commented Jan 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boucman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Jan 11, 2021

Uh oh!

poettering commented Jan 11, 2021

Uh oh!

poettering left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poettering commented Jan 11, 2021

Uh oh!

bluca commented Jan 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Feb 11, 2021

iaguis commented Jan 6, 2021 •

edited

Loading

iaguis commented Feb 11, 2021 •

edited

Loading