test/system: Optimize the resource limits tests#1552
Conversation
They haven't been of any use lately, and they do add some extra noise to each line in the CI logs. containers#1551
Commit 87eaeea already added a dependency on Bats >= 1.10.0, which is present on Fedora >= 39. Therefore, it should be exploited wherever possible to simplify things. containers#1551
|
Build failed. ✔️ unit-test SUCCESS in 6m 04s |
50ee96a to
c221f56
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 37s |
c221f56 to
c87d967
Compare
|
Build succeeded. ✔️ unit-test SUCCESS in 5m 36s |
c87d967 to
4eba8db
Compare
|
Build succeeded. ✔️ unit-test SUCCESS in 5m 24s |
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
The system tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes them slow. The runtime environment tests, which includes the resource limit tests, are particularly slow because they don't skip the I/O even when testing error handling. This makes them a good target for optimizations. The resource limit tests query the values for different resources from the same default container without changing its state. Therefore, a lot of disk I/O can be avoided by creating the default container only once for all the tests. This can save even 30 minutes. containers#1552
4eba8db to
fb9e2e7
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 44s |
|
recheck |
|
Build succeeded. ✔️ unit-test SUCCESS in 5m 43s |
The system tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes them slow. The runtime environment tests, which includes the resource limit tests, are particularly slow because they don't skip the I/O even when testing error handling. This makes them a good target for optimizations. The resource limit tests query the values for different resources from the same default container without changing its state. Therefore, a lot of disk I/O can be avoided by creating the default container only once for all the tests. This can save even 30 minutes. containers#1552 containers#1742 (backported from commit fb9e2e7)
The system tests can be very I/O intensive, because many of them copy
OCI images from the test suite's image cache directory to its local
container/storage store, create containers, and then delete everything
to run the next test with a clean slate. This makes them slow.
The runtime environment tests, which includes the resource limit tests,
are particularly slow because they don't skip the I/O even when testing
error handling. This makes them a good target for optimizations.
The resource limit tests query the values for different resources from
the same default container without changing its state. Therefore, a lot
of disk I/O can be avoided by creating the default container only once
for all the tests.
This can save even 30 minutes.