Skip to content

build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit#1548

Merged
debarshiray merged 2 commits intocontainers:mainfrom
debarshiray:wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols
Sep 26, 2024
Merged

build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit#1548
debarshiray merged 2 commits intocontainers:mainfrom
debarshiray:wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols

Conversation

@debarshiray
Copy link
Copy Markdown
Member

No description provided.

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 25, 2024
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag, because of how the NVIDIA Container Runtime
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [1,2].

The NVIDIA Container Runtime stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3)
is able to load the shared libraries at runtime, instead of doing it
when toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

Fallout from 6e848b2

[1] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[2] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from a130caa to 230434f Compare September 25, 2024 19:45
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 25, 2024
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].

The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

Fallout from 6e848b2

[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
    NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
    NVIDIA/go-nvml#18
    NVIDIA/nvidia-container-toolkit#49

[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from 230434f to c46765f Compare September 25, 2024 20:09
@debarshiray debarshiray changed the title [WIP] ... [WIP] build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit Sep 25, 2024
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/4f4221a7ca024d38868e6c28346b7d89

✔️ unit-test SUCCESS in 5m 37s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 28s
✔️ unit-test-restricted SUCCESS in 5m 29s
system-test-fedora-rawhide TIMED_OUT in 2h 10m 24s
✔️ system-test-fedora-40 SUCCESS in 1h 49m 49s
✔️ system-test-fedora-39 SUCCESS in 1h 52m 42s

@debarshiray
Copy link
Copy Markdown
Member Author

recheck

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/559440e69326447c8508c004ec8fd218

✔️ unit-test SUCCESS in 5m 36s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 06s
✔️ unit-test-restricted SUCCESS in 5m 28s
system-test-fedora-rawhide TIMED_OUT in 2h 10m 24s
✔️ system-test-fedora-40 SUCCESS in 1h 51m 17s
✔️ system-test-fedora-39 SUCCESS in 1h 54m 12s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 26, 2024
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].

The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

Currently, the CI has been frequently timing out on Fedora Rawhide
nodes.  So, increase the timeout from 2 hours 10 minutes to 2 hours 30
minutes to avoid that.

Fallout from 6e848b2

[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
    NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
    NVIDIA/go-nvml#18
    NVIDIA/nvidia-container-toolkit#49

[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from c46765f to fc31561 Compare September 26, 2024 10:52
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/2504608d71a74df9961618fbe5a2c9c2

✔️ unit-test SUCCESS in 5m 34s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 12s
✔️ unit-test-restricted SUCCESS in 5m 34s
system-test-fedora-rawhide TIMED_OUT in 2h 30m 23s
✔️ system-test-fedora-40 SUCCESS in 1h 56m 13s
system-test-fedora-39 TIMED_OUT in 2h 00m 25s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 26, 2024
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].

The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.

Fallout from 6e848b2

[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
    NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
    NVIDIA/go-nvml#18
    NVIDIA/nvidia-container-toolkit#49

[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from fc31561 to f46c905 Compare September 26, 2024 13:38
@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/4ba48eab7d2949fca035c538a1c2c79a

✔️ unit-test SUCCESS in 5m 36s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 47s
✔️ unit-test-restricted SUCCESS in 5m 42s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 19s
✔️ system-test-fedora-40 SUCCESS in 1h 53m 37s
✔️ system-test-fedora-39 SUCCESS in 1h 57m 57s

debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 26, 2024
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].

The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.

Fallout from 6e848b2

[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
    NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
    NVIDIA/go-nvml#18
    NVIDIA/nvidia-container-toolkit#49

[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from f46c905 to 504ab8d Compare September 26, 2024 16:51
@softwarefactory-project-zuul
Copy link
Copy Markdown

Zuul encountered a syntax error while parsing its
configuration in the repo containers/toolbox on branch main. The
problem was:

The job "system-test-fedora-40" exceeds tenant max-job-timeout 10800.

The problem appears in the the "system-test-fedora-40" job stanza:

job:
name: system-test-fedora-40
description: Run Toolbx's system tests in Fedora 40
timeout: 12600
nodeset:
nodes:
- name: fedora-40
label: cloud-fedora-40
pre-run: playbooks/setup-env.yaml
...

in "containers/toolbox/.zuul.yaml@main", line 62

The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].

The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it.  It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h.  It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.

This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.

Compare the output of:
  $ nm /path/to/toolbox | grep ' subid_init'

... with those from:
  $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
          U nvmlGpuInstanceGetComputeInstanceProfileInfoV
  $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
          U nvmlDeviceGetAccountingPids

Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
  $ toolbox
  toolbox: symbol lookup error: toolbox: undefined symbol:
      nvmlGpuInstanceGetComputeInstanceProfileInfoV

With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.

Fallout from 6e848b2

[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
    NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
    NVIDIA/go-nvml#18
    NVIDIA/nvidia-container-toolkit#49

[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda

[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
    https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml

containers#1548
The previous commit explains how the NVIDIA Container Toolkit is
sensitive to some linker flags.  Therefore, use the same linker flags
that are used by NVIDIA Container Toolkit to build the nvidia-cdi-hook,
nvidia-ctk, etc. binaries, because they use the same Go APIs that
toolbox(1) does [1].  It's better to use the same build configuration to
prevent subtle bugs from creeping in.

[1] NVIDIA Container Toolkit commit 772cf77dcc2347ce
    NVIDIA/nvidia-container-toolkit@772cf77dcc2347ce
    NVIDIA/nvidia-container-toolkit#333

containers#1548
@debarshiray debarshiray force-pushed the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch from 504ab8d to 66280a6 Compare September 26, 2024 16:54
@debarshiray
Copy link
Copy Markdown
Member Author

We will need to do something to cut down the amount of time it takes to go through all the tests. However, let's not block this pull request for that.

@softwarefactory-project-zuul
Copy link
Copy Markdown

Build failed.
https://softwarefactory-project.io/zuul/t/local/buildset/7131559f909b44beaa8fc94624851dfb

✔️ unit-test SUCCESS in 5m 36s
✔️ unit-test-migration-path-for-coreos-toolbox SUCCESS in 3m 37s
✔️ unit-test-restricted SUCCESS in 5m 32s
system-test-fedora-rawhide TIMED_OUT in 3h 00m 21s
✔️ system-test-fedora-40 SUCCESS in 1h 55m 12s
✔️ system-test-fedora-39 SUCCESS in 2h 01m 59s

@debarshiray debarshiray merged commit 66280a6 into containers:main Sep 26, 2024
@debarshiray debarshiray deleted the wip/rishi/src-go-build-wrapper-extldflags-export-dynamic-unresolved-symbols branch September 26, 2024 19:58
@debarshiray debarshiray changed the title [WIP] build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit Sep 26, 2024
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into two groups, which are run separately by different
nodes.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

containers#1551
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into two groups, which are run separately by different
nodes.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite(), teardown_suite() and the tests themselves [4].
Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to
optimize the contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers.  The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes.  Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower.  Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use.  Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4].  Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Sep 27, 2024
The test suite has expanded to 415 system tests.  These tests can be
very I/O intensive, because many of them copy OCI images from the test
suite's image cache directory to its local container/storage store,
create containers, and then delete everything to run the next test with
a clean slate.  This makes the system tests slow.

Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit
of 3 hours or 10800 seconds for jobs [1], and this is what Software
Factory uses [2].  So, there comes a point beyond which the CI can't be
prevented from timing out by increasing the timeout.

One way of scaling past this maximum time limit is to run the tests in
parallel across multiple nodes.  This has been implemented by splitting
the system tests into different groups, which are run separately by
different nodes.

First, the tests were grouped into those that test commands and options
accepted by the toolbox(1) binary, and those that test the runtime
environment within the Toolbx containers.  The first group has more
tests, but runs faster, because many of them test error handling and
don't do much I/O.

The runtime environment tests take especially long on Fedora Rawhide
nodes, which are often slower than the stable Fedora nodes.  Possibly
because Rawhide uses Linux kernels that are built with debugging
enabled, which makes it slower.  Therefore, this group of tests were
further split for Rawhide nodes by the Toolbx images they use.  Apart
from reducing the number of tests in each group, this should also reduce
the amount of time spent in downloading the images.

The split has been implemented with Bats' tagging system that is
available from Bats 1.8.0 [3].  Fortunately, commit 87eaeea
already added a dependency on Bats >= 1.10.0.  So, there's nothing to
worry about.

At the moment, Bats doesn't expose the tags being used to run the test
suite to setup_suite() and teardown_suite() [4].  Therefore, the
TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the
contents of setup_suite().

[1] https://zuul-ci.org/docs/zuul/latest/tenants.html

[2] Commit 83f28c5
    containers@83f28c52e47c2d44
    containers#1548

[3] https://bats-core.readthedocs.io/en/stable/writing-tests.html

[4] bats-core/bats-core#1006

containers#1551
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Jan 26, 2026
This keeps the timeout for the Fedora nodes synchronized with the main
branch.

containers#1548
containers#1741
(cherry picked from commit 83f28c5)
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Jan 27, 2026
This keeps the timeout for the Fedora nodes synchronized with the main
branch.

containers#1548
containers#1741
(backported from commit 83f28c5)
debarshiray added a commit to debarshiray/toolbox that referenced this pull request Feb 2, 2026
This keeps the timeout for the Fedora nodes synchronized with the main
branch.

containers#1548
containers#1741
containers#1750
(backported from commit 83f28c5)
(cherry picked from commit c09ef38)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants