Skip to content

System tests: honor $OCI_RUNTIME (for CI)#10199

Merged
openshift-merge-robot merged 1 commit intocontainers:masterfrom
edsantiago:system_tests_with_runc_override
May 4, 2021
Merged

System tests: honor $OCI_RUNTIME (for CI)#10199
openshift-merge-robot merged 1 commit intocontainers:masterfrom
edsantiago:system_tests_with_runc_override

Conversation

@edsantiago
Copy link
Copy Markdown
Member

Some CI systems set $OCI_RUNTIME as a way to override the
default crun. Integration (e2e) tests honor this, but system
tests were not aware of the convention; this means we haven't
been testing system tests with runc, which means RHEL gating
tests are now failing.

The proper solution would be to edit containers.conf on CI
systems. Sorry, that would involve too much CI-VM work.
Instead, this PR detects $OCI_RUNTIME and creates a dummy
containers.conf file using that runtime.

Add: various skips for tests that don't work with runc.

Refactor: add a helper function so we don't need to do
the complicated 'podman info blah blah .OCIRuntime.blah'
thing in many places.

BUG: we leave a tmp file behind on exit.

Signed-off-by: Ed Santiago santiago@redhat.com

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 3, 2021
@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented May 3, 2021

/approve
LGTM

@openshift-ci-robot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, rhatdan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@edsantiago edsantiago force-pushed the system_tests_with_runc_override branch from fb57ac4 to f2aa600 Compare May 3, 2021 20:29
@edsantiago
Copy link
Copy Markdown
Member Author

[re-pushed with a refinement: the pid=host failure only happens when rootless, not rootful. I updated the skip correspondingly, so it still runs on root tests.]

Comment on lines 57 to 59
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new skip; I realize that the clutter (removing very-old cruft) makes it hard to review. @rhatdan, do you have any idea why this would fail under runc? More importantly: is this an expected failure?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no memory of this. I guess the best we can do is try it out and see if it fails.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does fail -- see the newly-added comment (lines 54-55, green) (and again, I apologize for the difficulty of review. The removed cruft (red) dates to October 2020, when some CI systems had an old version of container-selinux.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

podman --runtime=runc run --pid=host alpine cat /proc/self/attr/current
Error: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: readonly path /proc/asound: operation not permitted: OCI permission denied

Weird
@giuseppe any idea what is going on here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more data point: on RHEL, IIRC, the failure is in /proc/bus instead of /proc/asound. This makes me suspect that it's just hitting the alphabetically-first entry under /proc. In case that helps.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like runc is attempting to write to /proc when --pid=host and SELinux is enabled, and failing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I build runc from main branch, I get.

podman --runtime=$PWD/runc run --pid=host alpine cat /proc/self/attr/current
unconfined_u:system_r:spc_t:s0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpm -qf /bin/runc
containerd.io-1.4.3-3.2.fc33.x86_64

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my f33 system: runc-1.0.0-377.rc93.fc33.x86_64

@edsantiago
Copy link
Copy Markdown
Member Author

edsantiago commented May 4, 2021

This PR has already been much more complicated than it should've been; and it's getting worse. Ubunto 2010 (which is now old-ubuntu, apparently) is now failing, and it's not a flake:

[+0589s] not ok 135 podman volume: exec/noexec
         # (from function `die' in file test/system/helpers.bash, line 412,
         #  from function `run_podman' in file test/system/helpers.bash, line 220,
         #  in test file test/system/160-volumes.bats, line 131)
         #   `run_podman ${expect_rc} run --rm --volume $myvolume:/vol:noexec,z $IMAGE /vol/myscript' failed
         # # podman rm --all --force
         # # podman ps --all --external --format {{.ID}} {{.Names}}
         # # podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20210223 4abe85964f5c
         # # podman volume rm -a
         # # podman volume create myvol15CCE5Uwiv
         # myvol15CCE5Uwiv
         # # podman volume inspect --format {{.Mountpoint}} myvol15CCE5Uwiv
         # /var/lib/containers/storage/volumes/myvol15CCE5Uwiv/_data
         # # podman run --rm --volume myvol15CCE5Uwiv:/vol:noexec,z quay.io/libpod/testimage:20210223 /vol/myscript
         # standard_init_linux.go:219: exec user process caused: permission denied
         # [ rc=1 (** EXPECTED 126 **) ]
         # #/vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
         # #| FAIL: exit code is 1; expected 126
         # #\^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'll look into it tomorrow. I will also have to deal with OCI_RUNTIME=/path (I had assumed it was just the name). [UPDATE: solved, by making podman_runtime() return just basename of the runtime)

Some CI systems set $OCI_RUNTIME as a way to override the
default crun. Integration (e2e) tests honor this, but system
tests were not aware of the convention; this means we haven't
been testing system tests with runc, which means RHEL gating
tests are now failing.

The proper solution would be to edit containers.conf on CI
systems. Sorry, that would involve too much CI-VM work.
Instead, this PR detects $OCI_RUNTIME and creates a dummy
containers.conf file using that runtime.

Add: various skips for tests that don't work with runc.

Refactor: add a helper function so we don't need to do
the complicated 'podman info blah blah .OCIRuntime.blah'
thing in many places.

BUG: we leave a tmp file behind on exit.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@edsantiago edsantiago force-pushed the system_tests_with_runc_override branch from f2aa600 to 9fd7ab5 Compare May 4, 2021 02:15
@edsantiago
Copy link
Copy Markdown
Member Author

Tests green, but I need to go over each and every system-test result to confirm that nothing was skipped inadvertently; I'll do that later this morning.

@edsantiago
Copy link
Copy Markdown
Member Author

Here we are. All looks good. For some reason I had thought f33 (prior fedora) would use runc; but no, it's only prior_ubuntu that uses runc. So we don't actually get any rootless runc testing. But anyway, PR is working as desired.

runtime root rootless remote(root)
fedora-33 crun ok ok ok
fedora-34 crun ok ok ok
ubuntu-2010 runc ok N/A ok
ubuntu-2104 crun ok N/A ok

There's only one useful result, but the others, at least I didn't break anything. I also confirmed the skips, that is, that I'm not inadvertently skipping code that should run.

@mheon
Copy link
Copy Markdown
Member

mheon commented May 4, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label May 4, 2021
@openshift-merge-robot openshift-merge-robot merged commit 8eefca5 into containers:master May 4, 2021
@edsantiago edsantiago deleted the system_tests_with_runc_override branch May 4, 2021 18:50
edsantiago added a commit to edsantiago/libpod that referenced this pull request Aug 18, 2022
This exposed a nasty bug in our system-test setup: Ubuntu (runc)
was writing a scratch containers.conf file, and setting CONTAINERS_CONF
to point to it. This was well-intentionedly introduced in containers#10199 as
part of our long sad history of not testing runc. What I did not
understand at that time is that CONTAINERS_CONF is **dangerous**:
it does not mean "I will read standard containers.conf and then
override", it means "I will **IGNORE** standard containers.conf
and use only the settings in this file"! So on Ubuntu we were
losing all the default settings: capabilities, sysctls, all.

Yes, this is documented in containers.conf(5) but it is such
a huge violation of POLA that I need to repeat it.

In containers#14972, as yet another attempt to fix our runc crisis, I
introduced a new runc-override mechanism: create a custom
/etc/containers/containers.conf when OCI_RUNTIME=runc.
Unlike the CONTAINERS_CONF envariable, the /etc file
actually means what you think it means: "read the default
file first, then override with the /etc file contents".
I.e., we get the desired defaults. But I didn't remember
this helpers.bash workaround, so our runc testing has
actually been flawed: we have not been testing with
the system containers.conf. This commit removes the
no-longer-needed and never-actually-wanted workaround,
and by virtue of testing the cap-drops in kube generate,
we add a regression test to make sure this never happens
again.

It's a little scary that we haven't been testing capabilities.

Also scary: this PR requires python, for converting yaml to json.
I think that should be safe: python3 'import yaml' and 'json'
works fine on a RHEL8.7 VM from 1minutetip.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants