.zuul, playbooks, test/system: Optimize the CI on Fedora nodes by debarshiray · Pull Request #1551 · containers/toolbox

debarshiray · 2024-09-26T20:41:32Z

The test suite has expanded to 415 system tests. These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate. This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2]. So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes. This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers. The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes. Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower. Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use. Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0. So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4]. Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
83f28c52e47c2d44
#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

They haven't been of any use lately, and they do add some extra noise to each line in the CI logs. containers#1551

softwarefactory-project-zuul · 2024-09-26T23:43:37Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/dd3db45aedd843888fe2ac05107b3d06

✔️ unit-test SUCCESS in 5m 45s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 30s
✔️ unit-test-restricted SUCCESS in 5m 51s
❌ system-test-fedora-rawhide TIMED_OUT in 3h 00m 17s
✔️ system-test-fedora-41 SUCCESS in 2h 11m 40s
✔️ system-test-fedora-40 SUCCESS in 2h 16m 09s
✔️ system-test-fedora-39 SUCCESS in 2h 17m 23s

debarshiray · 2024-09-27T00:25:02Z

recheck

softwarefactory-project-zuul · 2024-09-27T04:06:18Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/d2d272f2f1ef4c8a91f1bd457bc74906

✔️ unit-test SUCCESS in 5m 45s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 27s
✔️ unit-test-restricted SUCCESS in 5m 45s
❌ system-test-fedora-rawhide TIMED_OUT in 3h 00m 25s
✔️ system-test-fedora-41 SUCCESS in 2h 10m 46s
❌ system-test-fedora-40 TIMED_OUT in 2h 30m 35s
✔️ system-test-fedora-39 SUCCESS in 2h 20m 09s

Commit 87eaeea already added a dependency on Bats >= 1.10.0, which is present on Fedora >= 39. Therefore, it should be exploited wherever possible to simplify things. containers#1551

softwarefactory-project-zuul · 2024-09-27T12:03:12Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/d92dec050587422293cc37f199796a33

✔️ unit-test SUCCESS in 5m 50s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 14s
✔️ unit-test-restricted SUCCESS in 5m 45s
❌ system-test-fedora-rawhide TIMED_OUT in 3h 00m 24s
✔️ system-test-fedora-41 SUCCESS in 2h 04m 20s
✔️ system-test-fedora-40 SUCCESS in 2h 10m 14s
✔️ system-test-fedora-39 SUCCESS in 2h 06m 13s

The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html containers#1551

softwarefactory-project-zuul · 2024-09-27T15:23:09Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/c4cf55c66b6b4b3cbf13922642f4e861

✔️ unit-test SUCCESS in 4m 54s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 02s
✔️ unit-test-restricted SUCCESS in 4m 46s
❌ system-test-fedora-rawhide TIMED_OUT in 3h 00m 24s
✔️ system-test-fedora-41 SUCCESS in 2h 04m 52s
✔️ system-test-fedora-40 SUCCESS in 2h 08m 25s
✔️ system-test-fedora-39 SUCCESS in 2h 07m 20s

The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite(), teardown_suite() and the tests themselves [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551

softwarefactory-project-zuul · 2024-09-27T19:18:03Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/76ee9193324f48aa802848d1272d29df

✔️ unit-test SUCCESS in 5m 57s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 12s
✔️ unit-test-restricted SUCCESS in 5m 38s
✔️ system-test-fedora-rawhide-commands-and-options SUCCESS in 1h 28m 19s
❌ system-test-fedora-rawhide-runtime-environment TIMED_OUT in 2h 00m 22s
✔️ system-test-fedora-41-commands-and-options SUCCESS in 49m 45s
✔️ system-test-fedora-41-runtime-environment SUCCESS in 1h 18m 19s
✔️ system-test-fedora-40-commands-and-options SUCCESS in 48m 54s
✔️ system-test-fedora-40-runtime-environment SUCCESS in 1h 25m 34s
✔️ system-test-fedora-39-commands-and-options SUCCESS in 46m 54s
✔️ system-test-fedora-39-runtime-environment SUCCESS in 1h 14m 24s

The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551

softwarefactory-project-zuul · 2024-09-27T23:53:49Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/a61e7842cb16452a9314f4f74f93fc9c

softwarefactory-project-zuul · 2024-09-28T02:04:49Z

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/ae41db94c2a94fd19347de7002e266e2

Currently, the runtime environment tests have been frequently timing out on stable Fedora nodes. Instead of taking the shortcut of increasing the timeout, they were split by the Toolbx images they use, similar to what already happens for Fedora Rawhide nodes [1]. [1] Commit 987f5e2 containers@987f5e259289b4b3 containers#1551

Currently, the runtime environment tests have been frequently timing out on stable Fedora nodes. Instead of taking the shortcut of increasing the timeout, they were split by the Toolbx images they use, similar to what already happens for Fedora Rawhide nodes [1]. [1] Commit 987f5e2 containers@987f5e259289b4b3 containers#1551 containers#1571

debarshiray requested a review from martymichal as a code owner September 26, 2024 20:41

playbooks/system-test: Remove Bats' timing information

cb98871

They haven't been of any use lately, and they do add some extra noise to each line in the CI logs. containers#1551

debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 2702126 to cb98871 Compare September 26, 2024 20:42

test/system: Simplify line count checks by using Bats >= 1.10.0

e435704

Commit 87eaeea already added a dependency on Bats >= 1.10.0, which is present on Fedora >= 39. Therefore, it should be exploited wherever possible to simplify things. containers#1551

debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from edd16ae to e435704 Compare September 27, 2024 12:22

debarshiray changed the title ~~[WIP] tag~~ .zuul, playbooks, test/system: Make the CI take less time on Fedora Sep 27, 2024

debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 91561a7 to 5ec71fb Compare September 27, 2024 21:52

debarshiray changed the title ~~.zuul, playbooks, test/system: Make the CI take less time on Fedora~~ .zuul, playbooks, test/system: Optimize the CI on Fedora nodes Sep 27, 2024

debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 5ec71fb to 987f5e2 Compare September 28, 2024 00:01

debarshiray merged commit 987f5e2 into containers:main Sep 28, 2024

debarshiray deleted the wip/rishi/zuul-test-system-split-into-two branch September 28, 2024 11:07

debarshiray mentioned this pull request Oct 23, 2024

.zuul, playbooks: Optimize the CI on stable Fedora nodes #1571

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.zuul, playbooks, test/system: Optimize the CI on Fedora nodes#1551

.zuul, playbooks, test/system: Optimize the CI on Fedora nodes#1551
debarshiray merged 3 commits intocontainers:mainfrom
debarshiray:wip/rishi/zuul-test-system-split-into-two

debarshiray commented Sep 26, 2024 •

edited

Loading

Uh oh!

softwarefactory-project-zuul bot commented Sep 26, 2024

Uh oh!

debarshiray commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

debarshiray commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

softwarefactory-project-zuul bot commented Sep 26, 2024

Uh oh!

debarshiray commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 27, 2024

Uh oh!

softwarefactory-project-zuul bot commented Sep 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

debarshiray commented Sep 26, 2024 •

edited

Loading