Skip to content

Conversation

@iaguis
Copy link
Member

@iaguis iaguis commented Jan 6, 2021

This PR adds the RestrictFileSystems= property. When used, processes
belonging to a service are only able to access the filesystems listed in the
property.

This is implemented by attaching a BPF program to the file_open BPF LSM hook.
The program is attached at boot time and stays there forever. Then, when a
service specifying the RestrictFileSystems= property is started, an entry is
added to a global hash of maps BPF map. The map stores a set of filesystem
magic numbers per cgroupID. When a process tries to open a file, the BPF
program is executed and checks the cgroup the process is running in: if an
entry is present in the global map it checks if the filesystem the process is
trying to access is present in the set, if not, it denies access to it.

RestrictFileSystems= is only supported on systems with the LSM BPF hook
enabled and using cgroup2 (unified or hybrid).

This PR makes use of the libbpf framework introduced in #17655. Same as that PR,
it requires clang, llvm and libbpf at compile time, and the
libbpf shared library is optional at runtime (uses dlopen).

Thanks to the usage of libbpf, the program can use the CO-RE (Compile-Once
Run-Everywhere) technology so it doesn't require kernel headers at runtime to
access internal kernel structures.

Copy link
Contributor

@boucman boucman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a systemd devs, so take all my remarks as suggestions..


<listitem><para>Restricts the set of filesystems processes of this unit can open files on. Takes a space-separated
list of filesystem names. Any filesystem listed is made accessible to the unit's processes, access to filesystem
types not listed is prohibited (allow-listing). If the empty string is assigned, access to filesystems is not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add the negation syntax ?
The first use case I can think of is denying access to fat/vfat to some processes and I might not know the kind of filesystem the rootfs itself is on..

Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add the negation syntax ?

The current BPF implementation doesn't support this but it shouldn't be too difficult to support. I agree this would be nice.

Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull

Not sure this is too useful since it will only support one filesystem, the negation syntax is probably more flexible and should cover the same use case although the user would have to specify the filesystem type. We could also make it a list of directories but it feels weird to me.

<!--property RestrictNamespaces is not documented!-->

<!--property RestrictFileSystems is not documented!-->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for the dbus API too

I know it's new and everything, but let's try to take the good habits from the start

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize this was new, I thought not documenting it was a conscious decision. Will do.


<para>Note that this setting might not be supported on some systems (for example if the LSM eBPF hook is
not enabled in the underlying kernel or if not using the unified control group hierarchy). This setting
will fail in that case.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail as in "deny access" or as in "not implementing the security"
The general systemd philosophy is that if a feature is not available, things should just work. so we probably want the latter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case I meant fail as in "the service won't start and will print an error". I chose this because this is a security feature and I didn't want to give a false sense of security. However you're right that in general things just work when they're not supported, I'm on the fence here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC all the seccomp directives and seccomp directives are also best effort - if you build without seccomp/selinux support, they are simply ignored. I think the same pattern should be used - doesn't make much sense to diverge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... the other reason to do that is so that the same .service can be used on any distribution/setup. i.e distributions can use upstream provided services without haing to worry what is or isn't supported in their particular kernel

else if (streq(name, "zonefs"))
*ret = ZONEFS_MAGIC;
else if (streq(name, "zsmalloc"))
*ret = ZSMALLOC_MAGIC;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a user find that list of magic names ? is that the same one as the one the "mount" comand use ?
how to make that future-proof wrt futre filesystems or out-of-tree filesystems ? is there a better way to do that if/elsif list ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I've used the names in the file_system_type.name struct in the kernel, I assume they're the same ones the "mount" command uses but I'm not sure.

I don't have good ideas on how to make it future proof, I don't think there's any API to get the magic number for a name so it'd have to be something manual.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.

@iaguis iaguis force-pushed the iaguis/lsm-bpf branch 2 times, most recently from 6a2525d to abba9c5 Compare January 6, 2021 16:29
Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too late today or a full review, but some early comments

assert(path);
assert(ret);

h = malloc0(offsetof(struct file_handle, f_handle) + sizeof(uint64_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fixed size and small afaics, allocate on stack plz

assert(name);
assert(ret);

if (streq(name, "apparmorfs"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we generate this automatically from linux/magic.h (plus a few manual additions)? or at least have some tool that validates on build that everything in linux/magic.h is also listed here?

also, please use gperf for this, see errno-from-gperf.h and such

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look.

@poettering
Copy link
Member

Note the BPF LSM hook is very new and is not enabled in current distro kernels AFAIK.

Hmm, pity. Is this tracked anywhere? Did anyone ever file a bug against let's say Fedora asking for this to be added yet?

@poettering
Copy link
Member

i finally reviewed @wat-ze-hex's PR #16655 the other day, the commits stemming from there i'll not re-review here.

Copy link
Member

@poettering poettering left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work!

would love @wat-ze-hex input/review on this one!

_prog = bpf_object__find_program_by_title(object, title);
if (!_prog) {
return -EINVAL;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our usual coding style is not to place {} around single-line if blocks

_cleanup_free_ struct bpf_program *_prog = NULL;

assert(ret_prog);
assert(*ret_prog == NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our usual coding style assumes output parameters can be in undefined state, i.e. we don't vlidate output params on input, i.e. please drop this line

else if (streq(name, "zonefs"))
*ret = ZONEFS_MAGIC;
else if (streq(name, "zsmalloc"))
*ret = ZSMALLOC_MAGIC;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.

return 0;
}

int fs_type_from_string(const char *name, uint32_t *ret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32_tstatfs_f_type_t I figure?

the type exposed by statfs() is signed on some archs, and unsigned on other archs. but some defs use the MSB. it's a total mess.

@@ -0,0 +1,59 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in the the other PR, maybe we should name these .c files *.restricted.c or so? or *.bpf.c or so? i.e. indicate in the name that this is weird, special C code...

return log_unit_error_errno(u, r, "Failed to parse RestrictFileSystems: %m");
}

exec_start = strjoin("cat ", file_path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_se() around this please, to catch oom conditions


cld_code = SERVICE(u)->exec_command[SERVICE_EXEC_START]->exec_status.code;
if (cld_code != CLD_EXITED) {
r = -SYNTHETIC_ERRNO(EBUSY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYNTHETIC_ERRNO() exists purely to pass some extra info to log_error_errno(), it should only be used around the first arg of log_error_errno() (and related calls), but never really assigned to anything else. please move this into the first arg of the function call hence.

}

if (SERVICE(u)->state != SERVICE_DEAD) {
r = -SYNTHETIC_ERRNO(EBUSY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


assert_se(getrlimit(RLIMIT_MEMLOCK, &rl) >= 0);
rl.rlim_cur = rl.rlim_max = MAX(rl.rlim_max, CAN_MEMLOCK_SIZE);
(void) setrlimit(RLIMIT_MEMLOCK, &rl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setrlimit_closest()

assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("btrfs", "ext4")) < 0);
assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("debugfs", "btrfs", "ext4")) >= 0);

return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i figure most of the comments I wrote to @wat-ze-hex' test PR also apply here

@poettering
Copy link
Member

also needs rebase

@poettering poettering added needs-rebase reviewed/needs-rework 🔨 PR has been reviewed and needs another round of reworks labels Jan 11, 2021
@bluca
Copy link
Member

bluca commented Jan 12, 2021

Note the BPF LSM hook is very new and is not enabled in current distro kernels AFAIK.

Hmm, pity. Is this tracked anywhere? Did anyone ever file a bug against let's say Fedora asking for this to be added yet?

I'll start with Debian: https://salsa.debian.org/kernel-team/linux/-/merge_requests/306
There's already a bug for Ubuntu: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1905975

void *userdata) {
_cleanup_strv_free_ char **n = NULL;
size_t nlen = 0, nbufsize = 0;
char*** restrict_fs = data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
char*** restrict_fs = data;
char ***restrict_fs = data;

?

Base automatically changed from master to main January 21, 2021 11:55
Comment on lines 1708 to 1709
restricted. This option may appear more than once, in which case the namespace
types are merged by <constant>OR</constant>.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace types?

int BPF_PROG(restrict_filesystems, struct file *file, int ret)
{
unsigned long magic_number;
unsigned long tmpfs_magic;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable, right?

@iaguis iaguis force-pushed the iaguis/lsm-bpf branch 3 times, most recently from 3f08846 to 4e4a3a5 Compare February 11, 2021 16:32
@lgtm-com
Copy link

lgtm-com bot commented Feb 11, 2021

This pull request introduces 1 alert when merging 4e4a3a5 into 1c3c43a - view on LGTM.com

new alerts:

  • 1 for Unused import

@iaguis iaguis force-pushed the iaguis/lsm-bpf branch 3 times, most recently from 8d7f0b7 to 4839871 Compare February 11, 2021 17:37
@iaguis
Copy link
Member Author

iaguis commented Feb 11, 2021

I've addressed (I think) all review comments. Some high level changes:

  • RestrictFileSystems= accepts either an allow list or a deny list similar to SystemCallFilter.
  • I use gperf to generate the filesystem database.
  • I've added some filesystem groups as suggested and used those in existing functions.
    • I've added a systemd-analyze filesystems command to discover those groups.
  • I don't generate the filesystem list automatically by parsing magic.h because filesystem names are kinda arbitrary, but I've added a script that checks all filesystems defined in magic.h are present in the database.

Also, I was wondering if we should split @wat-ze-hex's commits to a separate PR now that there are 3 PRs using it (#17655, #18385, and this one). It might make things easier to merge.

@iaguis iaguis requested a review from poettering February 11, 2021 17:39
@iaguis
Copy link
Member Author

iaguis commented Oct 5, 2021

Fixed those two nits :)

@poettering
Copy link
Member

was about to merge, but needs a rebase i fear.

iaguis added 19 commits October 6, 2021 10:48
They were failing on CI.
Stores filesystem_name -> magic_number(s).
It hooks into the file_open LSM hook and allows only when the filesystem
where the open will take place is present in a BPF map for a particular
cgroup.

The BPF map used is a hash of maps with the following structure:

    cgroupID -> (s_magic -> uint32)

The inner map is effectively a set.

The entry at key 0 in the inner map encodes whether the program behaves
as an allow list or a deny list: if its value is 0 it is a deny list,
otherwise it is an allow list.

When the cgroupID is present in the map, the program checks the inner
map for the magic number of the filesystem associated with the file
that's being opened. When the program behaves as an allow list, if that
magic number is present it allows the open to succeed, when the program
behaves as a deny list, it only allows access if the that magic number
is NOT present. When access is denied the program returns -EPERM.

The BPF program uses CO-RE (Compile-Once Run-Everywhere) to access
internal kernel structures without needing kernel headers present at
runtime.
It returns the cgroupID from a cgroup path.
It will be used later.
They're needed for the LSM BPF feature.
This adds 6 functions to implement RestrictFileSystems=

* lsm_bpf_supported() checks if LSM BPF is supported. It checks that
  cgroupv2 is used, that BPF LSM is enabled, and tries to load the BPF
  LSM program which makes sure BTF and hash of maps are supported, and
  BPF LSM programs can be loaded.
* lsm_bpf_setup() loads and attaches the LSM BPF program.
* lsm_bpf_unit_restrict_filesystems() populates the hash of maps BPF map with the
  cgroupID and the set of allowed or denied filesystems.
* lsm_bpf_cleanup() removes a cgroupID entry from the hash of maps.
* lsm_bpf_map_restrict_fs_fd() is a helper function to get the file
  descriptor of the BPF map.
* lsm_bpf_destroy() is a wrapper around the destroy function of the BPF
  skeleton file.
It attaches the LSM BPF program when the system manager starts up.

It populates the hash of maps BPF map when services that have
RestrictFileSystems= set start.

It cleans up the hash of maps when the unit cgroup is pruned.

To pass the file descriptor of the BPF map we add it to the keep_fds
array.
It takes an allow or deny list of filesystems services should have
access to.
For distros that ship libbpf >=0.2.0.
@bluca bluca added good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed and removed needs-rebase labels Oct 6, 2021
@poettering
Copy link
Member

lgtm, thanks!

@poettering
Copy link
Member

CI failures appear unrelated. Merging! Thanks!

@poettering poettering merged commit 9a1ddc8 into systemd:main Oct 6, 2021
@iaguis iaguis deleted the iaguis/lsm-bpf branch October 6, 2021 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analyze bpf good-to-merge/waiting-for-ci 👍 PR is good to merge, but CI hasn't passed at time of review. Please merge if you see CI has passed mkosi new-feature pid1 systemctl

Development

Successfully merging this pull request may close these issues.

8 participants