Skip to content

.zuul, playbooks, test/system: Optimize the CI on Fedora nodes#1551

Merged
debarshiray merged 3 commits intocontainers:mainfrom
debarshiray:wip/rishi/zuul-test-system-split-into-two
Sep 28, 2024
Merged

.zuul, playbooks, test/system: Optimize the CI on Fedora nodes#1551
debarshiray merged 3 commits intocontainers:mainfrom
debarshiray:wip/rishi/zuul-test-system-split-into-two

Conversation

@debarshiray
Copy link
Copy Markdown
Member

@debarshiray debarshiray commented Sep 26, 2024

The test suite has expanded to 415 system tests. These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate. This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2]. So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes. This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers. The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes. Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower. Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use. Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0. So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4]. Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
83f28c52e47c2d44
#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

They haven't been of any use lately, and they do add some extra noise to
each line in the CI logs.

containers#1551
@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 2702126 to cb98871 Compare September 26, 2024 20:42
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/dd3db45aedd843888fe2ac05107b3d06

✔️ unit-test SUCCESS in 5m 45s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 30s
✔️ unit-test-restricted SUCCESS in 5m 51s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 17s
✔️ system-test-fedora-41 SUCCESS in 2h 11m 40s
✔️ system-test-fedora-40 SUCCESS in 2h 16m 09s
✔️ system-test-fedora-39 SUCCESS in 2h 17m 23s

@debarshiray
Copy link
Copy Markdown
Member Author

recheck

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/d2d272f2f1ef4c8a91f1bd457bc74906

✔️ unit-test SUCCESS in 5m 45s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 27s
✔️ unit-test-restricted SUCCESS in 5m 45s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 25s
✔️ system-test-fedora-41 SUCCESS in 2h 10m 46s
system-test-fedora-40 TIMED_OUT in 2h 30m 35s
✔️ system-test-fedora-39 SUCCESS in 2h 20m 09s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
Commit 87eaeea already added a dependency on Bats >= 1.10.0,
which is present on Fedora >= 39.  Therefore, it should be exploited
wherever possible to simplify things.

containers#1551
Commit 87eaeea already added a dependency on Bats >= 1.10.0,
which is present on Fedora >= 39.  Therefore, it should be exploited
wherever possible to simplify things.

containers#1551
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/d92dec050587422293cc37f199796a33

✔️ unit-test SUCCESS in 5m 50s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 14s
✔️ unit-test-restricted SUCCESS in 5m 45s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 24s
✔️ system-test-fedora-41 SUCCESS in 2h 04m 20s
✔️ system-test-fedora-40 SUCCESS in 2h 10m 14s
✔️ system-test-fedora-39 SUCCESS in 2h 06m 13s

@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from edd16ae to e435704 Compare September 27, 2024 12:22
@debarshiray debarshiray changed the title [WIP] tag .zuul, playbooks, test/system: Make the CI take less time on Fedora Sep 27, 2024
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into two groups, which are run separately by different
nodes.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

containers#1551
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/c4cf55c66b6b4b3cbf13922642f4e861

✔️ unit-test SUCCESS in 4m 54s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 02s
✔️ unit-test-restricted SUCCESS in 4m 46s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 24s
✔️ system-test-fedora-41 SUCCESS in 2h 04m 52s
✔️ system-test-fedora-40 SUCCESS in 2h 08m 25s
✔️ system-test-fedora-39 SUCCESS in 2h 07m 20s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into two groups, which are run separately by different
nodes.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite(), teardown_suite() and the tests themselves [4].
Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to
optimize the contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
@softwarefactory-project-zuul
Copy link
Copy Markdown

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers.  The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes.  Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower.  Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use.  Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4].  Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 91561a7 to 5ec71fb Compare September 27, 2024 21:52
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers.  The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes.  Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower.  Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use.  Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4].  Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
@debarshiray debarshiray changed the title .zuul, playbooks, test/system: Make the CI take less time on Fedora .zuul, playbooks, test/system: Optimize the CI on Fedora nodes Sep 27, 2024
@softwarefactory-project-zuul
Copy link
Copy Markdown

@debarshiray debarshiray force-pushed the wip/rishi/zuul-test-system-split-into-two branch from 5ec71fb to 987f5e2 Compare September 28, 2024 00:01
@softwarefactory-project-zuul
Copy link
Copy Markdown

@debarshiray debarshiray merged commit 987f5e2 into containers:main Sep 28, 2024
@debarshiray debarshiray deleted the wip/rishi/zuul-test-system-split-into-two branch September 28, 2024 11:07
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Oct 23, 2024
Currently, the runtime environment tests have been frequently timing out
on stable Fedora nodes.  Instead of taking the shortcut of increasing
the timeout, they were split by the Toolbx images they use, similar to
what already happens for Fedora Rawhide nodes [1].

[1] Commit 987f5e2
    containers@987f5e259289b4b3
    containers#1551
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Oct 23, 2024
Currently, the runtime environment tests have been frequently timing out
on stable Fedora nodes.  Instead of taking the shortcut of increasing
the timeout, they were split by the Toolbx images they use, similar to
what already happens for Fedora Rawhide nodes [1].

[1] Commit 987f5e2
    containers@987f5e259289b4b3
    containers#1551

containers#1571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant