build: Clarify which linker flags are supported and use the same ones as NVIDIA Container Toolkit#1548
Conversation
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag, because of how the NVIDIA Container Runtime
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [1,2].
The NVIDIA Container Runtime stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3)
is able to load the shared libraries at runtime, instead of doing it
when toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
Fallout from 6e848b2
[1] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[2] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
a130caa to
230434f
Compare
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
Fallout from 6e848b2
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
NVIDIA/go-nvml#18
NVIDIA/nvidia-container-toolkit#49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
230434f to
c46765f
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 37s |
|
recheck |
|
Build failed. ✔️ unit-test SUCCESS in 5m 36s |
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
Currently, the CI has been frequently timing out on Fedora Rawhide
nodes. So, increase the timeout from 2 hours 10 minutes to 2 hours 30
minutes to avoid that.
Fallout from 6e848b2
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
NVIDIA/go-nvml#18
NVIDIA/nvidia-container-toolkit#49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
c46765f to
fc31561
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 34s |
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.
Fallout from 6e848b2
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
NVIDIA/go-nvml#18
NVIDIA/nvidia-container-toolkit#49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
fc31561 to
f46c905
Compare
|
Build failed. ✔️ unit-test SUCCESS in 5m 36s |
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.
Fallout from 6e848b2
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
NVIDIA/go-nvml#18
NVIDIA/nvidia-container-toolkit#49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
f46c905 to
504ab8d
Compare
|
Zuul encountered a syntax error while parsing its The job "system-test-fedora-40" exceeds tenant max-job-timeout 10800. The problem appears in the the "system-test-fedora-40" job stanza: job: in "containers/toolbox/.zuul.yaml@main", line 62 |
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as
an external linker flag [1], because of how the NVIDIA Container Toolkit
stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at
runtime [2,3].
The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the
address of a symbol at runtime before using it. It links against
undefined symbols at build-time available through a CUDA API definition
embedded directly in the CGO code or a copy of nvml.h. It relies upon
lazily deferring function call resolution to the point when dlopen(3) is
able to load the shared libraries at runtime, instead of doing it when
toolbox(1) is started.
This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
libsubid.so at runtime.
Compare the output of:
$ nm /path/to/toolbox | grep ' subid_init'
... with those from:
$ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
U nvmlGpuInstanceGetComputeInstanceProfileInfoV
$ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
U nvmlDeviceGetAccountingPids
Using '-z now' as an external linker flag forces the dynamic linker to
resolve all symbols when toolbox(1) is started, and leads to:
$ toolbox
toolbox: symbol lookup error: toolbox: undefined symbol:
nvmlGpuInstanceGetComputeInstanceProfileInfoV
With the recent expansion of the test suite, it's necessary to increase
the timeout for the Fedora nodes to prevent the CI from timing out.
Fallout from 6e848b2
[1] NVIDIA Container Toolkit commit 1407ace94ab7c150
NVIDIA/nvidia-container-toolkit@1407ace94ab7c150
NVIDIA/go-nvml#18
NVIDIA/nvidia-container-toolkit#49
[2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
[3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
containers#1548
The previous commit explains how the NVIDIA Container Toolkit is
sensitive to some linker flags. Therefore, use the same linker flags
that are used by NVIDIA Container Toolkit to build the nvidia-cdi-hook,
nvidia-ctk, etc. binaries, because they use the same Go APIs that
toolbox(1) does [1]. It's better to use the same build configuration to
prevent subtle bugs from creeping in.
[1] NVIDIA Container Toolkit commit 772cf77dcc2347ce
NVIDIA/nvidia-container-toolkit@772cf77dcc2347ce
NVIDIA/nvidia-container-toolkit#333
containers#1548
504ab8d to
66280a6
Compare
|
We will need to do something to cut down the amount of time it takes to go through all the tests. However, let's not block this pull request for that. |
|
Build failed. ✔️ unit-test SUCCESS in 5m 36s |
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html containers#1551
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into two groups, which are run separately by different nodes. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite(), teardown_suite() and the tests themselves [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
The test suite has expanded to 415 system tests. These tests can be very I/O intensive, because many of them copy OCI images from the test suite's image cache directory to its local container/storage store, create containers, and then delete everything to run the next test with a clean slate. This makes the system tests slow. Unfortunately, Zuul's max-job-timeout setting defaults to an upper limit of 3 hours or 10800 seconds for jobs [1], and this is what Software Factory uses [2]. So, there comes a point beyond which the CI can't be prevented from timing out by increasing the timeout. One way of scaling past this maximum time limit is to run the tests in parallel across multiple nodes. This has been implemented by splitting the system tests into different groups, which are run separately by different nodes. First, the tests were grouped into those that test commands and options accepted by the toolbox(1) binary, and those that test the runtime environment within the Toolbx containers. The first group has more tests, but runs faster, because many of them test error handling and don't do much I/O. The runtime environment tests take especially long on Fedora Rawhide nodes, which are often slower than the stable Fedora nodes. Possibly because Rawhide uses Linux kernels that are built with debugging enabled, which makes it slower. Therefore, this group of tests were further split for Rawhide nodes by the Toolbx images they use. Apart from reducing the number of tests in each group, this should also reduce the amount of time spent in downloading the images. The split has been implemented with Bats' tagging system that is available from Bats 1.8.0 [3]. Fortunately, commit 87eaeea already added a dependency on Bats >= 1.10.0. So, there's nothing to worry about. At the moment, Bats doesn't expose the tags being used to run the test suite to setup_suite() and teardown_suite() [4]. Therefore, the TOOLBX_TEST_SYSTEM_TAGS environment variable was used to optimize the contents of setup_suite(). [1] https://zuul-ci.org/docs/zuul/latest/tenants.html [2] Commit 83f28c5 containers@83f28c52e47c2d44 containers#1548 [3] https://bats-core.readthedocs.io/en/stable/writing-tests.html [4] bats-core/bats-core#1006 containers#1551
This keeps the timeout for the Fedora nodes synchronized with the main branch. containers#1548 containers#1741 (cherry picked from commit 83f28c5)
This keeps the timeout for the Fedora nodes synchronized with the main branch. containers#1548 containers#1741 (backported from commit 83f28c5)
This keeps the timeout for the Fedora nodes synchronized with the main branch. containers#1548 containers#1741 containers#1750 (backported from commit 83f28c5) (cherry picked from commit c09ef38)
No description provided.