-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add RestrictFileSystems= property using LSM BPF #18145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
boucman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a systemd devs, so take all my remarks as suggestions..
man/systemd.exec.xml
Outdated
|
|
||
| <listitem><para>Restricts the set of filesystems processes of this unit can open files on. Takes a space-separated | ||
| list of filesystem names. Any filesystem listed is made accessible to the unit's processes, access to filesystem | ||
| types not listed is prohibited (allow-listing). If the empty string is assigned, access to filesystems is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to add the negation syntax ?
The first use case I can think of is denying access to fat/vfat to some processes and I might not know the kind of filesystem the rootfs itself is on..
Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to add the negation syntax ?
The current BPF implementation doesn't support this but it shouldn't be too difficult to support. I agree this would be nice.
Alternatlively some some sort of AllowFilesystemsForDirectory= where we give a path and it allows whatever filesystem that path is on could be usefull
Not sure this is too useful since it will only support one filesystem, the negation syntax is probably more flexible and should cover the same use case although the user would have to specify the filesystem type. We could also make it a list of directories but it feels weird to me.
| <!--property RestrictNamespaces is not documented!--> | ||
|
|
||
| <!--property RestrictFileSystems is not documented!--> | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation for the dbus API too
I know it's new and everything, but let's try to take the good habits from the start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize this was new, I thought not documenting it was a conscious decision. Will do.
man/systemd.exec.xml
Outdated
|
|
||
| <para>Note that this setting might not be supported on some systems (for example if the LSM eBPF hook is | ||
| not enabled in the underlying kernel or if not using the unified control group hierarchy). This setting | ||
| will fail in that case.</para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail as in "deny access" or as in "not implementing the security"
The general systemd philosophy is that if a feature is not available, things should just work. so we probably want the latter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I meant fail as in "the service won't start and will print an error". I chose this because this is a security feature and I didn't want to give a false sense of security. However you're right that in general things just work when they're not supported, I'm on the fence here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC all the seccomp directives and seccomp directives are also best effort - if you build without seccomp/selinux support, they are simply ignored. I think the same pattern should be used - doesn't make much sense to diverge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes... the other reason to do that is so that the same .service can be used on any distribution/setup. i.e distributions can use upstream provided services without haing to worry what is or isn't supported in their particular kernel
src/basic/stat-util.c
Outdated
| else if (streq(name, "zonefs")) | ||
| *ret = ZONEFS_MAGIC; | ||
| else if (streq(name, "zsmalloc")) | ||
| *ret = ZSMALLOC_MAGIC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would a user find that list of magic names ? is that the same one as the one the "mount" comand use ?
how to make that future-proof wrt futre filesystems or out-of-tree filesystems ? is there a better way to do that if/elsif list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, I've used the names in the file_system_type.name struct in the kernel, I assume they're the same ones the "mount" command uses but I'm not sure.
I don't have good ideas on how to make it future proof, I don't think there's any API to get the magic number for a name so it'd have to be something manual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.
6a2525d to
abba9c5
Compare
poettering
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too late today or a full review, but some early comments
src/basic/cgroup-util.c
Outdated
| assert(path); | ||
| assert(ret); | ||
|
|
||
| h = malloc0(offsetof(struct file_handle, f_handle) + sizeof(uint64_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fixed size and small afaics, allocate on stack plz
src/basic/stat-util.c
Outdated
| assert(name); | ||
| assert(ret); | ||
|
|
||
| if (streq(name, "apparmorfs")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we generate this automatically from linux/magic.h (plus a few manual additions)? or at least have some tool that validates on build that everything in linux/magic.h is also listed here?
also, please use gperf for this, see errno-from-gperf.h and such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look.
Hmm, pity. Is this tracked anywhere? Did anyone ever file a bug against let's say Fedora asking for this to be added yet? |
|
i finally reviewed @wat-ze-hex's PR #16655 the other day, the commits stemming from there i'll not re-review here. |
poettering
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work!
would love @wat-ze-hex input/review on this one!
src/shared/bpf-object.c
Outdated
| _prog = bpf_object__find_program_by_title(object, title); | ||
| if (!_prog) { | ||
| return -EINVAL; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our usual coding style is not to place {} around single-line if blocks
src/shared/bpf-object.c
Outdated
| _cleanup_free_ struct bpf_program *_prog = NULL; | ||
|
|
||
| assert(ret_prog); | ||
| assert(*ret_prog == NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our usual coding style assumes output parameters can be in undefined state, i.e. we don't vlidate output params on input, i.e. please drop this line
src/basic/stat-util.c
Outdated
| else if (streq(name, "zonefs")) | ||
| *ret = ZONEFS_MAGIC; | ||
| else if (streq(name, "zsmalloc")) | ||
| *ret = ZSMALLOC_MAGIC; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have systemd-analyze capability and systemd-analyze syscall-filter to dump caps and syscalls systemd knows and compare them with what the kernel knows. it might make sense to add a similar verb that lists this table, plus adds warning lines for file systems that are listed in /proc/filesystems but we don't know of.
src/basic/stat-util.c
Outdated
| return 0; | ||
| } | ||
|
|
||
| int fs_type_from_string(const char *name, uint32_t *ret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint32_t → statfs_f_type_t I figure?
the type exposed by statfs() is signed on some archs, and unsigned on other archs. but some defs use the MSB. it's a total mess.
| @@ -0,0 +1,59 @@ | |||
| /* SPDX-License-Identifier: GPL-2.0-or-later */ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in the the other PR, maybe we should name these .c files *.restricted.c or so? or *.bpf.c or so? i.e. indicate in the name that this is weird, special C code...
src/test/test-bpf-lsm.c
Outdated
| return log_unit_error_errno(u, r, "Failed to parse RestrictFileSystems: %m"); | ||
| } | ||
|
|
||
| exec_start = strjoin("cat ", file_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_se() around this please, to catch oom conditions
src/test/test-bpf-lsm.c
Outdated
|
|
||
| cld_code = SERVICE(u)->exec_command[SERVICE_EXEC_START]->exec_status.code; | ||
| if (cld_code != CLD_EXITED) { | ||
| r = -SYNTHETIC_ERRNO(EBUSY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYNTHETIC_ERRNO() exists purely to pass some extra info to log_error_errno(), it should only be used around the first arg of log_error_errno() (and related calls), but never really assigned to anything else. please move this into the first arg of the function call hence.
src/test/test-bpf-lsm.c
Outdated
| } | ||
|
|
||
| if (SERVICE(u)->state != SERVICE_DEAD) { | ||
| r = -SYNTHETIC_ERRNO(EBUSY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
src/test/test-bpf-lsm.c
Outdated
|
|
||
| assert_se(getrlimit(RLIMIT_MEMLOCK, &rl) >= 0); | ||
| rl.rlim_cur = rl.rlim_max = MAX(rl.rlim_max, CAN_MEMLOCK_SIZE); | ||
| (void) setrlimit(RLIMIT_MEMLOCK, &rl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setrlimit_closest()
| assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("btrfs", "ext4")) < 0); | ||
| assert_se(test_restrict_filesystems(m, "restrict_filesystems_test.service", "/sys/kernel/debug/sleep_time", STRV_MAKE("debugfs", "btrfs", "ext4")) >= 0); | ||
|
|
||
| return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i figure most of the comments I wrote to @wat-ze-hex' test PR also apply here
|
also needs rebase |
I'll start with Debian: https://salsa.debian.org/kernel-team/linux/-/merge_requests/306 |
src/core/load-fragment.c
Outdated
| void *userdata) { | ||
| _cleanup_strv_free_ char **n = NULL; | ||
| size_t nlen = 0, nbufsize = 0; | ||
| char*** restrict_fs = data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| char*** restrict_fs = data; | |
| char ***restrict_fs = data; |
?
man/systemd.exec.xml
Outdated
| restricted. This option may appear more than once, in which case the namespace | ||
| types are merged by <constant>OR</constant>.</para> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Namespace types?
| int BPF_PROG(restrict_filesystems, struct file *file, int ret) | ||
| { | ||
| unsigned long magic_number; | ||
| unsigned long tmpfs_magic; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variable, right?
abba9c5 to
a907bcd
Compare
3f08846 to
4e4a3a5
Compare
|
This pull request introduces 1 alert when merging 4e4a3a5 into 1c3c43a - view on LGTM.com new alerts:
|
8d7f0b7 to
4839871
Compare
|
I've addressed (I think) all review comments. Some high level changes:
Also, I was wondering if we should split @wat-ze-hex's commits to a separate PR now that there are 3 PRs using it (#17655, #18385, and this one). It might make things easier to merge. |
|
Fixed those two nits :) |
|
was about to merge, but needs a rebase i fear. |
It fits better there.
They were failing on CI.
Stores filesystem_name -> magic_number(s).
It hooks into the file_open LSM hook and allows only when the filesystem
where the open will take place is present in a BPF map for a particular
cgroup.
The BPF map used is a hash of maps with the following structure:
cgroupID -> (s_magic -> uint32)
The inner map is effectively a set.
The entry at key 0 in the inner map encodes whether the program behaves
as an allow list or a deny list: if its value is 0 it is a deny list,
otherwise it is an allow list.
When the cgroupID is present in the map, the program checks the inner
map for the magic number of the filesystem associated with the file
that's being opened. When the program behaves as an allow list, if that
magic number is present it allows the open to succeed, when the program
behaves as a deny list, it only allows access if the that magic number
is NOT present. When access is denied the program returns -EPERM.
The BPF program uses CO-RE (Compile-Once Run-Everywhere) to access
internal kernel structures without needing kernel headers present at
runtime.
It returns the cgroupID from a cgroup path.
It will be used later.
They're needed for the LSM BPF feature.
This adds 6 functions to implement RestrictFileSystems= * lsm_bpf_supported() checks if LSM BPF is supported. It checks that cgroupv2 is used, that BPF LSM is enabled, and tries to load the BPF LSM program which makes sure BTF and hash of maps are supported, and BPF LSM programs can be loaded. * lsm_bpf_setup() loads and attaches the LSM BPF program. * lsm_bpf_unit_restrict_filesystems() populates the hash of maps BPF map with the cgroupID and the set of allowed or denied filesystems. * lsm_bpf_cleanup() removes a cgroupID entry from the hash of maps. * lsm_bpf_map_restrict_fs_fd() is a helper function to get the file descriptor of the BPF map. * lsm_bpf_destroy() is a wrapper around the destroy function of the BPF skeleton file.
It attaches the LSM BPF program when the system manager starts up. It populates the hash of maps BPF map when services that have RestrictFileSystems= set start. It cleans up the hash of maps when the unit cgroup is pruned. To pass the file descriptor of the BPF map we add it to the keep_fds array.
It takes an allow or deny list of filesystems services should have access to.
For distros that ship libbpf >=0.2.0.
|
lgtm, thanks! |
|
CI failures appear unrelated. Merging! Thanks! |
This PR adds the
RestrictFileSystems=property. When used, processesbelonging to a service are only able to access the filesystems listed in the
property.
This is implemented by attaching a BPF program to the
file_openBPF LSM hook.The program is attached at boot time and stays there forever. Then, when a
service specifying the
RestrictFileSystems=property is started, an entry isadded to a global hash of maps BPF map. The map stores a set of filesystem
magic numbers per cgroupID. When a process tries to open a file, the BPF
program is executed and checks the cgroup the process is running in: if an
entry is present in the global map it checks if the filesystem the process is
trying to access is present in the set, if not, it denies access to it.
RestrictFileSystems=is only supported on systems with the LSM BPF hookenabled and using cgroup2 (unified or hybrid).
This PR makes use of the libbpf framework introduced in #17655. Same as that PR,
it requires clang, llvm and libbpf at compile time, and the
libbpf shared library is optional at runtime (uses dlopen).
Thanks to the usage of libbpf, the program can use the CO-RE (Compile-Once
Run-Everywhere) technology so it doesn't require kernel headers at runtime to
access internal kernel structures.