.zuul, playbooks, test/system: Optimize the CI on Fedora nodes#1551
Conversation
They haven't been of any use lately, and they do add some extra noise to each line in the CI logs. containers#1551
2702126 to
cb98871
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 45s |
|
recheck |
|
Build failed. ✔️ unit-test SUCCESS in 5m 45s |
Commit 87eaeea already added a dependency on Bats >= 1.10.0, which is present on Fedora >= 39. Therefore, it should be exploited wherever possible to simplify things. containers#1551
Commit 87eaeea already added a dependency on Bats >= 1.10.0, which is present on Fedora >= 39. Therefore, it should be exploited wherever possible to simplify things. containers#1551
|
Build failed. ✔️ unit-test SUCCESS in 5m 50s |
edd16ae to
e435704
Compare
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html containers#1551
|
Build failed. ✔️ unit-test SUCCESS in 4m 54s |
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite(), teardown_suite() and the tests themselves [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
|
Build failed. ✔️ unit-test SUCCESS in 5m 57s |
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
91561a7 to
5ec71fb
Compare
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
|
Build failed. ✔️ unit-test SUCCESS in 5m 34s |
5ec71fb to
987f5e2
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 42s |
Currently, the runtime environment tests have been frequently timing out on stable Fedora nodes. Instead of taking the shortcut of increasing the timeout, they were split by the Toolbx images they use, similar to what already happens for Fedora Rawhide nodes [1]. [1] Commit 987f5e2 containers@987f5e259289b4b3 containers#1551
Currently, the runtime environment tests have been frequently timing out on stable Fedora nodes. Instead of taking the shortcut of increasing the timeout, they were split by the Toolbx images they use, similar to what already happens for Fedora Rawhide nodes [1]. [1] Commit 987f5e2 containers@987f5e259289b4b3 containers#1551 containers#1571
The test suite has expanded to 415 system tests. These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate. This makes the system tests slow.
Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2]. So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.
One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes. This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.
First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers. The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.
The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes. Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower. Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use. Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.
The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0. So, there's nothing to
worry about.
At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4]. Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().
[1] https://zuul-ci.org/docs/zuul/latest/tenants.html
[2] Commit 83f28c5
83f28c52e47c2d44
#1548
[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html
[4] bats-core/bats-core#1006