Skip to content

test/system: Optimize the resource limits tests#1552

Merged
debarshiray merged 4 commits intocontainers:mainfrom
debarshiray:wip/rishi/zuul-test-system-split-into-two-throwaway
Sep 29, 2024
Merged

test/system: Optimize the resource limits tests#1552
debarshiray merged 4 commits intocontainers:mainfrom
debarshiray:wip/rishi/zuul-test-system-split-into-two-throwaway

Conversation

@debarshiray
Copy link
Copy Markdown
Member

@debarshiray debarshiray commented Sep 27, 2024

The system tests can be very I/O intensive, because many of them copy
OCI images from the test suite's image cache directory to its local
container/storage store, create containers, and then delete everything
to run the next test with a clean slate. This makes them slow.

The runtime environment tests, which includes the resource limit tests,
are particularly slow because they don't skip the I/O even when testing
error handling. This makes them a good target for optimizations.

The resource limit tests query the values for different resources from
the same default container without changing its state. Therefore, a lot
of disk I/O can be avoided by creating the default container only once
for all the tests.

This can save even 30 minutes.

They haven't been of any use lately, and they do add some extra noise to
each line in the CI logs.

containers#1551
Commit 87eaeea already added a dependency on Bats >= 1.10.0,
which is present on Fedora >= 39.  Therefore, it should be exploited
wherever possible to simplify things.

containers#1551
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/1db6d936ace1413ab05a39f00d3504a7

✔️ unit-test SUCCESS in 6m 04s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 42s
✔️ unit-test-restricted SUCCESS in 5m 46s
✔️ system-test-fedora-rawhide-commands-and-options SUCCESS in 1h 46m 42s
system-test-fedora-rawhide-runtime-environment TIMED_OUT in 2h 00m 24s
✔️ system-test-fedora-41-commands-and-options SUCCESS in 54m 05s
✔️ system-test-fedora-41-runtime-environment SUCCESS in 1h 27m 14s
✔️ system-test-fedora-40 SUCCESS in 2h 06m 57s
✔️ system-test-fedora-39 SUCCESS in 2h 08m 32s

@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two-throwaway branch from 50ee96a to c221f56 Compare September 27, 2024 16:44
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/b880cd9770f540f9a6c64bf86061a1ec

✔️ unit-test SUCCESS in 5m 37s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 28s
✔️ unit-test-restricted SUCCESS in 5m 26s
✔️ system-test-fedora-rawhide-commands-and-options SUCCESS in 1h 28m 15s
system-test-fedora-rawhide-runtime-environment TIMED_OUT in 2h 00m 27s
✔️ system-test-fedora-41-commands-and-options SUCCESS in 46m 40s
✔️ system-test-fedora-41-runtime-environment SUCCESS in 1h 17m 09s
✔️ system-test-fedora-40-commands-and-options SUCCESS in 46m 56s
✔️ system-test-fedora-40-runtime-environment SUCCESS in 1h 12m 34s
✔️ system-test-fedora-39 SUCCESS in 1h 54m 03s

@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two-throwaway branch from c221f56 to c87d967 Compare September 27, 2024 19:51
@softwarefactory-project-zuul
Copy link
Copy Markdown

@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two-throwaway branch from c87d967 to 4eba8db Compare September 27, 2024 22:00
@softwarefactory-project-zuul
Copy link
Copy Markdown

The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers.  The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes.  Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower.  Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use.  Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4].  Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
The system tests can be very I/O intensive, because many of them copy
OCI images from the test suite's image cache directory to its local
container/storage store, create containers, and then delete everything
to run the next test with a clean slate.  This makes them slow.

The runtime environment tests, which includes the resource limit tests,
are particularly slow because they don't skip the I/O even when testing
error handling.  This makes them a good target for optimizations.

The resource limit tests query the values for different resources from
the same default container without changing its state.  Therefore, a lot
of disk I/O can be avoided by creating the default container only once
for all the tests.

This can save even 30 minutes.

containers#1552
@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two-throwaway branch from 4eba8db to fb9e2e7 Compare September 27, 2024 23:32
@debarshiray debarshiray changed the title [WIP] Throwaway test/system: Optimize the resource limits tests Sep 27, 2024
@softwarefactory-project-zuul
Copy link
Copy Markdown

@debarshiray
Copy link
Copy Markdown
Member Author

recheck

@softwarefactory-project-zuul
Copy link
Copy Markdown

@debarshiray debarshiray merged commit fb9e2e7 into containers:main Sep 29, 2024
@debarshiray debarshiray deleted the wip/rishi/zuul-test-system-split-into-two-throwaway branch September 29, 2024 11:21
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Jan 27, 2026
The system tests can be very I/O intensive, because many of them copy
OCI images from the test suite's image cache directory to its local
container/storage store, create containers, and then delete everything
to run the next test with a clean slate.  This makes them slow.

The runtime environment tests, which includes the resource limit tests,
are particularly slow because they don't skip the I/O even when testing
error handling.  This makes them a good target for optimizations.

The resource limit tests query the values for different resources from
the same default container without changing its state.  Therefore, a lot
of disk I/O can be avoided by creating the default container only once
for all the tests.

This can save even 30 minutes.

containers#1552
containers#1742
(backported from commit fb9e2e7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant