-
Notifications
You must be signed in to change notification settings - Fork 2.2k
rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT #3967
Conversation
rata
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, left some minor comments :-)
|
@rpluem-vf |
|
@AkihiroSuda I think it should, because the issue was caused by containerd requesting only But that test is removed, so added a comment to keep it :) |
I haven't tested it and hence I am not sure. In general I agree with the imposed logic to fail if some mount flags are explicitly requested that are not possible with a rootless bind mount due to the underlying mount flags of the source filesystem are the opposite, but it changes the original test case (before #3805). I am not sure if the direct users of runc might not at least explicitly request |
|
There is a test issue with this I'm still debugging.
Imho this behaviour is a bug. We could special-case it if it breaks something in 1.2-rc1 but silently overwriting an option the user explicitly passed seems wrong. If the user doesn't care if the flag is set, they should omit it. I suspect it's actually a spec violation to not enforce explicitly specified flags (then again, if this breaks users we need to figure out a workaround). |
|
@cyphar +1. Plus I don't know of any real world example of that use case. |
|
@cyphar it seems cirrus is failing due to a timeout. Wanna re-push to re-trigger the CI? |
|
The timeout is the test issue I'm debugging 😉 |
|
The timeout for cirrus jobs is 30 minutes, and normally they are finished in 5 to 20 minutes, depending on a job (see e.g. https://cirrus-ci.com/build/5948000845955072). Reproduced the issue locally: [kir@kir-rhat runc]$ sudo make localintegration
....
ok 204 userns with inaccessible mount + exec
ok 205 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1
ok 206 runc version
# HANGS HEREOr, for a narrower scope, we can use sudo bats tests/integration/mounts_sshfs.batsLet's see what's going on once it hangs: $ ps axf
...
11841 pts/0 S+ 0:00 | | \_ sudo make localintegration RUNC_USE_SYSTEMD=yes
11849 pts/0 S+ 0:00 | | \_ make localintegration RUNC_USE_SYSTEMD=yes
14138 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
14146 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats -t tests/integration
14147 pts/0 S+ 0:00 | | \_ /usr/bin/bash /usr/libexec/bats-core/bats-format-cat --dummy-flag
14148 pts/0 S+ 0:00 | | \_ cat
...This is bats waiting for something to finish/close. I've seen it before when there is some leftover process that inherits an extra fd from bats. This time the process is recvtty; in fact, two: Executing Usually, recvtty is killed from It happens because with the newly added cases, due to Here's the fix: diff --git a/tests/integration/mounts_sshfs.bats b/tests/integration/mounts_sshfs.bats
index 09fcced2..921bc063 100644
--- a/tests/integration/mounts_sshfs.bats
+++ b/tests/integration/mounts_sshfs.bats
@@ -8,9 +8,12 @@ function setup() {
}
function teardown() {
- # Some distros do not have fusermount installed
- # as a dependency of fuse-sshfs, and good ol' umount works.
- fusermount -u "$DIR" || umount "$DIR"
+ if [ -v DIR ]; then
+ # Some distros do not have fusermount installed
+ # as a dependency of fuse-sshfs, and good ol' umount works.
+ fusermount -u "$DIR" || umount "$DIR"
+ unset DIR
+ fi
teardown_bundle
}PS this dark corner of bash/bats is yet another example of why rewriting everything all shell tests in Go or Python would be a good idea. As much as I love bash, when things are getting too complicated, it becomes unbearable. |
|
Thanks @kolyshkin! It looked like a teardown issue, but I was busy with other things so I didn't look into it too deeply. Looks pretty nasty to track down. Also, looking at the tests again, I think
The problem is that writing integration tests becomes more difficult, and there might be some slight differences between the testing environment (being spawned from a That being said, as someone who rewrote large parts of the test running code in bats, that codebase is pretty gnarly and I am surprised it works as well as it does. |
|
After doing some more digging, we actually have had pretty serious bugs in this code for a long time -- it seems that nobody has actually noticed that requesting "rw" specifically would be ignored and it was a question of random chance whether bind-mounts would clear flags not specified in the options of the mount, and a bunch of other issues. We basically need to revert 97f5ee4 (which, funnily enough, was trying to fix the same issue as #3805) as well. It seems clear to me the correct behaviour should be "give the user the exact set of options they specified, and if not possible then allow the original options to be set if they didn't explicitly request the opposite". At the moment, neither condition is consistently met by runc. |
|
@opencontainers/runc-maintainers this should be ready to review. The atime behaviour is quite subtle so I've added some (possibly even too many) comments to explain the reasoning. This will need a prominent changelog entry too... |
libcontainer/rootfs_linux.go
Outdated
| // other flags, we default to MS_STRICTATIME since that's the only | ||
| // setting that makes sense here. Critically, we *always* call mount(2) | ||
| // with some MS_*ATIME flag set to ensure we have consistent results. | ||
| if m.Flags&mntAtimeFlags == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should also handle "nostrictatime" here too. Or maybe "norelatime,nostrictatime" should be rejected as a configuration because it makes no sense (what atime is the user actually requesting?)? Or maybe we should just copy the host flags?
While copying the host flags seems "reasonable", this is at odds with every other flag (which we clear if not explicitly requested). On the other hand, flags like "atime" and "diratime" make very little sense without being based on the original mount's settings. On the other other hand, the same thing could be said about the "dev", "exec", etc options -- they only have an effect in the fallback case.
I suspect that we might want to add an "inherit" flag to allow users to explicitly request inheriting the original mount's settings. Only inheriting some flags will just lead to more confusion.
|
@cyphar can you label this as impact/changelog and add a changelog entry? (We can update it afterwards if some changes are done during review) |
This comment was marked as resolved.
This comment was marked as resolved.
kolyshkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The atime flags logic is somewhat hard to follow, but (1) it seems to do what the comments describe and (2) I couldn't see a way to make it more readable.
I think what's missing from this PR is a bunch of test cases. From the top of my head, I don't remember any tests covering atime flags.
|
@kolyshkin The last integration test added has a bunch of arime flag tests, what other kind of tests would you prefer? |
My bad, something mixed up in my mind. Left a couple of nits for tests, otherwise LGTM |
|
Marked this as a draft because there are ambiguities in runtime-spec as to the right behaviour here. A straight-forward reading is that we need to copy the
From where I'm standing, Also, the above behaviour from I think we need a spec issue for this... |
|
Opened this issue to tackle this down: #4093 |
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has, which is why you probably haven't noticed the issue). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker that I don't think is necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker that I don't think is necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors.) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors) The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1242132 Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1242132 Signed-off-by: rcmadhankumar <madhankumar.chellamuthu@suse.com>
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1242132
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The backstory for this is that runc 1.2 (opencontainers/runc#3967) fixed a long-standing bug in our mount flag handling (a bug that crun still has). Before runc 1.2, when dealing with locked mount flags that user namespaced containers cannot clear, trying to explicitly clearing locked flags (like rw clearing MS_RDONLY) would silently ignore the rw flag in most cases and would result in a read-only mount. This is obviously not what the user expects. What runc 1.2 did is that it made it so that passing clearing flags like rw would always result in an attempt to clear the flag (which was not the case before), and would (in all cases) explicitly return an error if we try to clear locking flags. (This also let us finally fix a bunch of other long-standing issues with locked mount flags causing seemingly spurious errors). The problem is that podman sets rw on all mounts by default (even if the user doesn't specify anything). This is actually a no-op in runc 1.1 and crun because of a bug in how clearing flags were handled (rw is the absence of MS_RDONLY but until runc 1.2 we didn't correctly track clearing flags like that, meaning that rw would literally be handled as if it were not set at all by users) but in runc 1.2 leads to unfortunate breakages and a subtle change in behaviour (before, a ro mount being bind-mounted into a container would also be ro -- though due to the above bug even setting rw explicitly would result in ro in most cases -- but with runc 1.2 the mount will always be rw even if the user didn't explicitly request it which most users would find surprising). By the way, this "always set rw" behaviour is a departure from Docker and it is not necesssary. Bugs: bsc#1239776
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4 ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de ("Fix failure with rw bind mount of a ro fuse") and
da780e4 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see 1) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
Revert 97f5ee4 ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
Improve handling of atime in the fallback case by:
produce errors rather than silently producing incorrect atime
mounts.
Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
Reverts: 97f5ee4 ("Only remount if requested flags differ from current")
Fixes: 50105de ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai cyphar@cyphar.com