rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967

cyphar · 2023-08-06T03:56:44Z

The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4 ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de ("Fix failure with rw bind mount of a ro fuse") and
da780e4 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see 1) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
Revert 97f5ee4 ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
  produce errors rather than silently producing incorrect atime
  mounts.
Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

Reverts: 97f5ee4 ("Only remount if requested flags differ from current")
Fixes: 50105de ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai cyphar@cyphar.com

rata

Overall LGTM, left some minor comments :-)

libcontainer/configs/mount_linux.go

libcontainer/specconv/spec_linux.go

AkihiroSuda · 2023-08-07T11:13:12Z

@rpluem-vf
Does this still solve the original issue?

tests/integration/mounts_sshfs.bats

rata · 2023-08-07T11:45:20Z

@AkihiroSuda I think it should, because the issue was caused by containerd requesting only {"rbind", "ro"} and the source had "nosuid", "nodev", "noexec" (we fixed it in this PR: https://github.com/containerd/containerd/pull/8309/files). I haven't checked that, though.

But that test is removed, so added a comment to keep it :)

rpluem-vf · 2023-08-07T11:52:50Z

@rpluem-vf Does this still solve the original issue?

I haven't tested it and hence I am not sure. In general I agree with the imposed logic to fail if some mount flags are explicitly requested that are not possible with a rootless bind mount due to the underlying mount flags of the source filesystem are the opposite, but it changes the original test case (before #3805). I am not sure if the direct users of runc might not at least explicitly request rw in certain cases when the underlying file system is ro. IMHO this worked before #3805 and with #3805, but I think it stops working now.

cyphar · 2023-08-07T12:57:07Z

There is a test issue with this I'm still debugging.

I am not sure if the direct users of runc might not at least explicitly request rw in certain cases when the underlying file system is ro. IMHO this worked before #3805 and with #3805, but I think it stops working now.

Imho this behaviour is a bug. We could special-case it if it breaks something in 1.2-rc1 but silently overwriting an option the user explicitly passed seems wrong. If the user doesn't care if the flag is set, they should omit it. I suspect it's actually a spec violation to not enforce explicitly specified flags (then again, if this breaks users we need to figure out a workaround).

rata · 2023-08-07T13:22:56Z

@cyphar +1. Plus I don't know of any real world example of that use case.

rata · 2023-08-08T11:57:01Z

@cyphar it seems cirrus is failing due to a timeout. Wanna re-push to re-trigger the CI?

cyphar · 2023-08-08T12:54:36Z

The timeout is the test issue I'm debugging 😉

kolyshkin · 2023-08-08T23:40:38Z

The timeout for cirrus jobs is 30 minutes, and normally they are finished in 5 to 20 minutes, depending on a job (see e.g. https://cirrus-ci.com/build/5948000845955072).

Reproduced the issue locally:

[kir@kir-rhat runc]$ sudo make localintegration
....
ok 204 userns with inaccessible mount + exec
ok 205 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1
ok 206 runc version
# HANGS HERE

Or, for a narrower scope, we can use

sudo bats tests/integration/mounts_sshfs.bats

Let's see what's going on once it hangs:

$ ps axf
  ...
  11841 pts/0    S+     0:00  |       |   \_ sudo make localintegration RUNC_USE_SYSTEMD=yes
  11849 pts/0    S+     0:00  |       |       \_ make localintegration RUNC_USE_SYSTEMD=yes
  14138 pts/0    S+     0:00  |       |           \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
  14146 pts/0    S+     0:00  |       |               \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
  14147 pts/0    S+     0:00  |       |               \_ /usr/bin/bash /usr/libexec/bats-core/bats-format-cat --dummy-flag
  14148 pts/0    S+     0:00  |       |                   \_ cat
  ...

This is bats waiting for something to finish/close. I've seen it before when there is some leftover process that inherits an extra fd from bats.

This time the process is recvtty; in fact, two:

  28474 pts/0    Sl+    0:00 /home/kir/go/src/github.com/opencontainers/runc/tests/integration/../../contrib/cmd/recvtty/recvtty --pid-file /tmp/bats-run-FpRMWe/runc.H7EP7t/tty/pid --mode null /tmp/bats-run-FpRMWe/runc.H7EP7
  28530 pts/0    Sl+    0:00 /home/kir/go/src/github.com/opencontainers/runc/tests/integration/../../contrib/cmd/recvtty/recvtty --pid-file /tmp/bats-run-FpRMWe/runc.cZWnfG/tty/pid --mode null /tmp/bats-run-FpRMWe/runc.cZWnf

Executing killall recvtty un-stucks bats.

Usually, recvtty is killed from teardown_bundle, we can assume it is there because teardown_bundle was not called.

It happens because with the newly added cases, due to requires rootless, we do not call setup_sshfs and then teardown silently fails on accessing the non-existing variable $DIR. The "silent" part is still a mystery to me (bats bug?), and the "fails" part is due to set -u in tests/integration/helpers.bash (which I still want to keep as it is usually helpful in finding issue with the shell code).

Here's the fix:

diff --git a/tests/integration/mounts_sshfs.bats b/tests/integration/mounts_sshfs.bats
index 09fcced2..921bc063 100644
--- a/tests/integration/mounts_sshfs.bats
+++ b/tests/integration/mounts_sshfs.bats
@@ -8,9 +8,12 @@ function setup() {
 }
 
 function teardown() {
-       # Some distros do not have fusermount installed
-       # as a dependency of fuse-sshfs, and good ol' umount works.
-       fusermount -u "$DIR" || umount "$DIR"
+       if [ -v DIR ]; then
+               # Some distros do not have fusermount installed
+               # as a dependency of fuse-sshfs, and good ol' umount works.
+               fusermount -u "$DIR" || umount "$DIR"
+               unset DIR
+       fi
 
        teardown_bundle
 }

PS this dark corner of bash/bats is yet another example of why rewriting everything all shell tests in Go or Python would be a good idea. As much as I love bash, when things are getting too complicated, it becomes unbearable.

kolyshkin · 2023-08-09T00:53:17Z

Created #3975 to test. @cyphar you can cherry-pick 3d777f9 to here.

cyphar · 2023-08-09T01:22:35Z

Thanks @kolyshkin! It looked like a teardown issue, but I was busy with other things so I didn't look into it too deeply. Looks pretty nasty to track down.

Also, looking at the tests again, I think requires rootless is wrong. We should just be running the process in a userns... Let me rewrite them.

PS this dark corner of bash/bats is yet another example of why rewriting everything all shell tests in Go or Python would be a good idea. As much as I love bash, when things are getting too complicated, it becomes unbearable.

The problem is that writing integration tests becomes more difficult, and there might be some slight differences between the testing environment (being spawned from a go test process) than from an actual shell. Docker gave up on integration-cli tests a long time ago for this reason.

That being said, as someone who rewrote large parts of the test running code in bats, that codebase is pretty gnarly and I am surprised it works as well as it does.

cyphar · 2023-08-09T05:27:48Z

After doing some more digging, we actually have had pretty serious bugs in this code for a long time -- it seems that nobody has actually noticed that requesting "rw" specifically would be ignored and it was a question of random chance whether bind-mounts would clear flags not specified in the options of the mount, and a bunch of other issues. We basically need to revert 97f5ee4 (which, funnily enough, was trying to fix the same issue as #3805) as well.

It seems clear to me the correct behaviour should be "give the user the exact set of options they specified, and if not possible then allow the original options to be set if they didn't explicitly request the opposite". At the moment, neither condition is consistently met by runc.

cyphar · 2023-08-10T05:19:06Z

@opencontainers/runc-maintainers this should be ready to review. The atime behaviour is quite subtle so I've added some (possibly even too many) comments to explain the reasoning. This will need a prominent changelog entry too...

cyphar · 2023-08-10T05:23:54Z

libcontainer/rootfs_linux.go

+		// other flags, we default to MS_STRICTATIME since that's the only
+		// setting that makes sense here. Critically, we *always* call mount(2)
+		// with some MS_*ATIME flag set to ensure we have consistent results.
+		if m.Flags&mntAtimeFlags == 0 {


I guess we should also handle "nostrictatime" here too. Or maybe "norelatime,nostrictatime" should be rejected as a configuration because it makes no sense (what atime is the user actually requesting?)? Or maybe we should just copy the host flags?

While copying the host flags seems "reasonable", this is at odds with every other flag (which we clear if not explicitly requested). On the other hand, flags like "atime" and "diratime" make very little sense without being based on the original mount's settings. On the other other hand, the same thing could be said about the "dev", "exec", etc options -- they only have an effect in the fallback case.

I suspect that we might want to add an "inherit" flag to allow users to explicitly request inheriting the original mount's settings. Only inheriting some flags will just lead to more confusion.

rata · 2023-08-10T09:43:04Z

@cyphar can you label this as impact/changelog and add a changelog entry? (We can update it afterwards if some changes are done during review)

libcontainer/rootfs_linux.go

kolyshkin

LGTM. The atime flags logic is somewhat hard to follow, but (1) it seems to do what the comments describe and (2) I couldn't see a way to make it more readable.

I think what's missing from this PR is a bunch of test cases. From the top of my head, I don't remember any tests covering atime flags.

cyphar · 2023-08-10T22:09:56Z

@kolyshkin The last integration test added has a bunch of arime flag tests, what other kind of tests would you prefer?

tests/integration/mounts_sshfs.bats

kolyshkin · 2023-08-11T00:07:21Z

@kolyshkin The last integration test added has a bunch of arime flag tests, what other kind of tests would you prefer?

My bad, something mixed up in my mind. Left a couple of nits for tests, otherwise LGTM

cyphar · 2023-08-11T07:12:14Z

Marked this as a draft because there are ambiguities in runtime-spec as to the right behaviour here. A straight-forward reading is that we need to copy the mount(8) behaviour, but for bind-mounts, which mount behaviour is being referenced? There is no "remount" option in the spec, but we need to pass MS_REMOUNT|MS_BIND for symlinks. The behaviour for mount -o $flags, mount -o remount,$flags, mount --bind -o $flags, and mount --bind -o remount,$flags are all different.

mount -o $flags just passes the requested flags.
mount --bind -o $flags to create the mount is somewhat like our current (imho completely broken) behaviour, where you will inherit the original mount flags unless you ask for something weird, in which case we clear all of the other mount flags as an accidental consequence of setting a flag. I cannot see this as anything other than a bug because they literally have the same behaviour where requesting a flag be cleared doesn't actually clear the flag. It even has the bug where passing rw will be ignored. (mount --bind -o rw on a read-only mount source results in an ro mount!)
mount --bind -o remount,$flags will take the current flags and then apply the requested settings. This is at least somewhat better than mount --bind -o because it doesn't ignore clearing flags in such a horrific way, but it does have the issue that atime handling is completely busted -- basically the issue is that they treat the atime flags as if they were regular mount flags when in fact noatime,relatime,strictatime are more like an enum and nodiratime is a separate flag (that has funky behaviour with relatime due to the way the kernel calculates the default settings) -- all of which has very finicky semantics. These behaviours are obviously bugs:
- mount --bind -o remount,nodiratime,norelatime will give you a nodiratime,relatime mount.
- mount --bind -o remount,relatime on a noatime mount will result in a noatime mount.
mount -o remount,$flags is like mount --bind -o remount,$flags but with MS_BIND missing, meaning this will change the superblock flags rather than the mount flags.

From where I'm standing, mount --bind -o remount,$flags is the least-bad behaviour but it should handle atime "properly". Our current implementation is somewhat close to mount --bind -o $flags. This PR is like mount --bind -o $flags except we always set flags to ensure consistent results -- this has the result of making it so that clearing flags are not ignored (we clear everything unless you requested it, so clearing flags are a no-op but at least you'll never silently get a different mount to the one you requested).

Also, the above behaviour from mount(8) looks so broken to me that I'm probably going to send a patch to fix it.

I think we need a spec issue for this...

rata · 2023-10-24T14:52:44Z

Opened this issue to tackle this down: #4093

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has, which is why you probably haven't noticed the issue). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker that I don't think is necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker that I don't think is necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1242132 Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1242132

The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776

cyphar added this to the 1.2.0 milestone Aug 6, 2023

cyphar mentioned this pull request Aug 6, 2023

Allow bind mounts of nodev,nosuid,noexec filesystems #3805

Merged

rata approved these changes Aug 7, 2023

View reviewed changes

libcontainer/configs/mount_linux.go Show resolved Hide resolved

libcontainer/specconv/spec_linux.go Show resolved Hide resolved

rata reviewed Aug 7, 2023

View reviewed changes

tests/integration/mounts_sshfs.bats Outdated Show resolved Hide resolved

kolyshkin mentioned this pull request Aug 9, 2023

[DNM carry/fix #3967] Remove mount fallback flag #3975

Closed

cyphar changed the title ~~rootfs: replace --no-mount-fallback option with better heuristic~~ rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT Aug 10, 2023

cyphar commented Aug 10, 2023

View reviewed changes

This comment was marked as resolved.

Sign in to view

kolyshkin reviewed Aug 10, 2023

View reviewed changes

libcontainer/rootfs_linux.go Outdated Show resolved Hide resolved

kolyshkin approved these changes Aug 10, 2023

View reviewed changes

kolyshkin reviewed Aug 10, 2023

View reviewed changes

tests/integration/mounts_sshfs.bats Outdated Show resolved Hide resolved

kolyshkin reviewed Aug 11, 2023

View reviewed changes

tests/integration/mounts_sshfs.bats Outdated Show resolved Hide resolved

kolyshkin self-requested a review August 11, 2023 00:07

cyphar marked this pull request as draft August 11, 2023 02:52

rata mentioned this pull request Oct 24, 2023

Tests broken in debian due to PR: rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #4093

Closed

kolyshkin added the impact/changelog label Oct 24, 2023

cyphar mentioned this pull request Mar 14, 2024

release 1.2.0-rc.1 #4221

Merged

This was referenced Jan 2, 2025

runc: 1.1.15 -> 1.3.0 NixOS/nixpkgs#353610

Merged

rootless bind-mount failure for read-only volume with 1.2.[0-4] #4575

Closed

cyphar mentioned this pull request Apr 22, 2025

Remove using rw as a default mount option containers/podman#25942

Merged

giuseppe mentioned this pull request Apr 22, 2025

report errors when attempting to reset a locked mount flag containers/crun#1724

Open

rcmadhankumar mentioned this pull request May 26, 2025

bsc#1239776: remove appending rw as the default mount option SUSE/podman#16

Merged

rcmadhankumar mentioned this pull request May 26, 2025

Remove appending rw as the default mount option(#bsc1239776) SUSE/podman#17

Merged

cyphar mentioned this pull request Jul 31, 2025

All read-only mounts set through the CSI plugin have failed. #4824

Open

rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967

rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967

Uh oh!

Conversation

cyphar commented Aug 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AkihiroSuda commented Aug 7, 2023

Uh oh!

Uh oh!

rata commented Aug 7, 2023

Uh oh!

rpluem-vf commented Aug 7, 2023

Uh oh!

cyphar commented Aug 7, 2023

Uh oh!

rata commented Aug 7, 2023

Uh oh!

rata commented Aug 8, 2023

Uh oh!

cyphar commented Aug 8, 2023

Uh oh!

kolyshkin commented Aug 8, 2023

Uh oh!

kolyshkin commented Aug 9, 2023

Uh oh!

cyphar commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyphar commented Aug 10, 2023

Uh oh!

cyphar Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rata commented Aug 10, 2023

Uh oh!

This comment was marked as resolved.

Uh oh!

kolyshkin left a comment

Choose a reason for hiding this comment

Uh oh!

cyphar commented Aug 10, 2023

Uh oh!

Uh oh!

Uh oh!

kolyshkin commented Aug 11, 2023

Uh oh!

cyphar commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rata commented Oct 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

cyphar commented Aug 6, 2023 •

edited

Loading

cyphar commented Aug 9, 2023 •

edited

Loading

cyphar commented Aug 9, 2023 •

edited

Loading

cyphar Aug 10, 2023 •

edited

Loading

cyphar commented Aug 11, 2023 •

edited

Loading