Skip to content

e2e tests: use /var/tmp, not $TMPDIR, as workdirs#22207

Merged
openshift-merge-bot[bot] merged 1 commit intocontainers:mainfrom
edsantiago:e2e-var-tmp
Apr 26, 2024
Merged

e2e tests: use /var/tmp, not $TMPDIR, as workdirs#22207
openshift-merge-bot[bot] merged 1 commit intocontainers:mainfrom
edsantiago:e2e-var-tmp

Conversation

@edsantiago
Copy link
Copy Markdown
Member

@edsantiago edsantiago commented Mar 28, 2024

TMPDIR is typically /tmp which is typically(*) a tmpfs.

This PR ignores $TMPDIR when $CI is defined, forcing all
e2e tests to set up one central working directory in /var/tmp
instead.

Also, lots of cleanup.

(*) For many years, up to and still including the time of
this PR, /tmp on Fedora CI VMs is actually NOT tmpfs,
it is just / (root). This is nonstandard and undesirable.

VMs: containers/automation_images#340

Signed-off-by: Ed Santiago santiago@redhat.com

None

@openshift-ci openshift-ci bot added release-note-none approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 28, 2024
@edsantiago edsantiago marked this pull request as draft March 28, 2024 19:37
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 28, 2024
@edsantiago edsantiago force-pushed the e2e-var-tmp branch 3 times, most recently from 626030f to 1e30d68 Compare April 1, 2024 19:58
@edsantiago
Copy link
Copy Markdown
Member Author

Well, it's working. I'm leaving this as draft because it's a pretty big change, but I consider this ready for review.

@Luap99
Copy link
Copy Markdown
Member

Luap99 commented Apr 2, 2024

hmm, how important is this? Is this a requirement for the tmpfs change?

I think using a tmpfs makes sense locally, well at least on systems with enough memory. I think it should be faster and not cause so much wear on the ssd/hdd. Although I haven't yet tried benchmarking both options to see how much of a difference it really makes.

@edsantiago
Copy link
Copy Markdown
Member Author

hmm, how important is this? Is this a requirement for the tmpfs change?

How important: I can't really answer that. I would like for /tmp to be tmpfs on our CI machines (containers/automation_images#340), and maybe @mheon would too, but it's probably not the most pressing issue atm.

And yes, this PR is a hard requirement for the above automation_images PR. e2e tests consume huge amounts of disk space. They won't even come close to running in tmpfs on a CI VM.

Timing: great question and one that is troubling me. This seems to be slower, but I haven't figured out why.

Timing results from this PR:

type distro user DB local remote container
int rawhide root 38:14 38:12
int rawhide rootless 37:20
int fedora-39 root 38:31 35:44 37:01
int fedora-39 rootless 39:09
int fedora-38 root boltdb 40:55 40:13 39:16
int fedora-38 rootless boltdb 40:21
int debian-13 root 37:40 38:23
int debian-13 rootless 34:08

Timing results from #22229, the most recently merged PR on main:

type distro user DB local remote container
int rawhide root 33:46 31:56
int rawhide rootless 34:12
int fedora-39 root 35:46 33:22 32:44
int fedora-39 rootless 32:35
int fedora-38 root boltdb 37:57 35:00 32:40
int fedora-38 rootless boltdb 34:54
int debian-13 root 32:49 31:04
int debian-13 rootless 33:03

I'm not seeing the same slowdown on #17831 (which I'm using for extra testing). Now that I've posted (and preserved) timing results, I'm going to just re-push this and see how it goes. Yesterday's CI run had a lot of quay flakes that required reruns.

@Luap99
Copy link
Copy Markdown
Member

Luap99 commented Apr 2, 2024

CI time is definitely a big factor but given we didn't use tmpfs before I do not think this changes anything for CI in particular other than some paths.

I am more worried about local runs, it has been a while but the last time I worked on the major e2e changes I was able to run the full suite in under 5 mins on my laptop (12 threads). Given the much faster CPU compared to CI I think IO could become a big bottleneck locally and given that I have 32 gigs of RAM I like to use it.

Comment on lines 141 to 143
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point re: local runs. My initial design had an envariable check here, PODMAN_E2E_TMPDIR, but it felt too clunky. I could instead do:

    use $TMPDIR
    if $CI envariable is defined
        then use /var/tmp

Would that help?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me but I am not sure this condition actually depends on CI or not. It depends on whenever a system has enough available RAM to run the suite fully in RAM. If I would like to run this on my small PI it most likely will not work. So maybe we should measure how much RAM we actually need and decide base don this.

Also for CI it might be worth to investigate to fully run it from memory given we want to make CI faster.

@edsantiago
Copy link
Copy Markdown
Member Author

Better results this morning:

type distro user DB local remote container
int rawhide root 35:13 32:52
int rawhide rootless 33:53
int fedora-39 root 34:52 37:18 34:06
int fedora-39 rootless 34:13
int fedora-38 root boltdb 35:35 34:01 32:11
int fedora-38 rootless boltdb 39:52
int debian-13 root 34:19 33:59
int debian-13 rootless 31:36

@edsantiago edsantiago force-pushed the e2e-var-tmp branch 6 times, most recently from b70abba to 4eabb75 Compare April 9, 2024 11:48
@edsantiago edsantiago force-pushed the e2e-var-tmp branch 2 times, most recently from cddfb9e to f4bc63e Compare April 11, 2024 14:23
@edsantiago edsantiago marked this pull request as ready for review April 11, 2024 18:57
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 11, 2024
@edsantiago
Copy link
Copy Markdown
Member Author

I'm calling this ready, but would like a lot of eyeballs on it please.

@packit-as-a-service
Copy link
Copy Markdown

Ephemeral COPR build failed. @containers/packit-build please check.

@rhatdan
Copy link
Copy Markdown
Member

rhatdan commented Apr 22, 2024

@Luap99 I think this is in your bucket now. Very few people run the full test suite locally, and everyone suffers waiting for the test suite to complete, as well as cost and saving the planet, so I think we should move forward on this.

@edsantiago edsantiago force-pushed the e2e-var-tmp branch 2 times, most recently from 4fb8065 to d58c503 Compare April 22, 2024 16:33
Copy link
Copy Markdown
Member

@cevich cevich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, LGTM on the CI side of things, I double-checked hack/get_ci_vm.sh and contrib/cirrus/lib.sh and don't see any reason there would be problems.

Copy link
Copy Markdown
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, I will also run this locally to check if there are any new failures.

Comment on lines 144 to 149
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really dislike this CI magic, different behaviours between local and CI are a nightmare to debug.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine to keep this for now

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI VMs don't have enough RAM to run these tests in tmpfs. If we choose to experiment with super-duper RAM VMs, that's groovy, and it can easily be layered on top of this PR. In fact it can't even be started without this PR, because this PR is what makes /tmp be tmpfs.

All other feedback addressed, thank you.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by suggestion: I don't think we need the entire /var/tmp on a ramdisk. Might be worth while using some diskspace tracking tool to see what the /var/tmp/subtest-* consumes over time. Maybe we don't require as big super-duper RAM VMs as we think? Also, improvements to test-cleanup code could help out a lot here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI VMs don't have enough RAM to run these tests in tmpfs. If we choose to experiment with super-duper RAM VMs, that's groovy, and it can easily be layered on top of this PR. In fact it can't even be started without this PR, because this PR is what makes /tmp be tmpfs.

Which is why I said I am fine with this for now.

I don't suggest making /var/tmp tmpfs, we simple need to switch back to /tmp as root. This PR has the nessesary ground work to easily switch the basedir so we can experiment with it when it is merged

@Luap99
Copy link
Copy Markdown
Member

Luap99 commented Apr 23, 2024

I did some local runs, I can run the rootless e2e tests in 4m when based on tmpfs (the used test images were cached on real disk). By using /var/tmp (my loclal SSD) it took over 6m20s so over 50% slower. So there is big potential in time savings.

This PR doesn't make it worse so I think I am fine going ahead with it once you address my comments until I have more time tweaking some options.

Copy link
Copy Markdown
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: edsantiago, Luap99

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Luap99
Copy link
Copy Markdown
Member

Luap99 commented Apr 26, 2024

@baude @mheon PTAL

@mheon
Copy link
Copy Markdown
Member

mheon commented Apr 26, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2024
@mheon
Copy link
Copy Markdown
Member

mheon commented Apr 26, 2024

Needs a rebase

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2024
TMPDIR is typically /tmp which is typically(*) a tmpfs.

This PR ignores $TMPDIR when $CI is defined, forcing all
e2e tests to set up one central working directory in /var/tmp
instead.

Also, lots of cleanup.

 (*) For many years, up to and still including the time of
     this PR, /tmp on Fedora CI VMs is actually NOT tmpfs,
     it is just / (root). This is nonstandard and undesirable.
     Efforts are underway to remove this special case.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 26, 2024
@mheon
Copy link
Copy Markdown
Member

mheon commented Apr 26, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 9ac6d9d into containers:main Apr 26, 2024
@edsantiago edsantiago deleted the e2e-var-tmp branch April 26, 2024 22:29
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 26, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. release-note-none

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants