Skip to content

Commit 83f28c5

Browse files
committed
build: Notify distributors that the '-z now' linker flag is unsupported
The '-z now' flag, which is the opposite of '-z lazy', is unsupported as an external linker flag [1], because of how the NVIDIA Container Toolkit stack uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at runtime [2,3]. The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the address of a symbol at runtime before using it. It links against undefined symbols at build-time available through a CUDA API definition embedded directly in the CGO code or a copy of nvml.h. It relies upon lazily deferring function call resolution to the point when dlopen(3) is able to load the shared libraries at runtime, instead of doing it when toolbox(1) is started. This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load libsubid.so at runtime. Compare the output of: $ nm /path/to/toolbox | grep ' subid_init' ... with those from: $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV' U nvmlGpuInstanceGetComputeInstanceProfileInfoV $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids' U nvmlDeviceGetAccountingPids Using '-z now' as an external linker flag forces the dynamic linker to resolve all symbols when toolbox(1) is started, and leads to: $ toolbox toolbox: symbol lookup error: toolbox: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV With the recent expansion of the test suite, it's necessary to increase the timeout for the Fedora nodes to prevent the CI from timing out. Fallout from 6e848b2 [1] NVIDIA Container Toolkit commit 1407ace94ab7c150 NVIDIA/nvidia-container-toolkit@1407ace94ab7c150 NVIDIA/go-nvml#18 NVIDIA/nvidia-container-toolkit#49 [2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda [3] https://github.com/NVIDIA/go-nvml/blob/main/README.md https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml #1548
1 parent dd23baa commit 83f28c5

File tree

2 files changed

+45
-3
lines changed

2 files changed

+45
-3
lines changed

.zuul.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@
5151
- job:
5252
name: system-test-fedora-rawhide
5353
description: Run Toolbx's system tests in Fedora Rawhide
54-
timeout: 7800
54+
timeout: 10800
5555
nodeset:
5656
nodes:
5757
- name: fedora-rawhide
@@ -62,7 +62,7 @@
6262
- job:
6363
name: system-test-fedora-40
6464
description: Run Toolbx's system tests in Fedora 40
65-
timeout: 7200
65+
timeout: 9000
6666
nodeset:
6767
nodes:
6868
- name: fedora-40
@@ -73,7 +73,7 @@
7373
- job:
7474
name: system-test-fedora-39
7575
description: Run Toolbx's system tests in Fedora 39
76-
timeout: 7200
76+
timeout: 9000
7777
nodeset:
7878
nodes:
7979
- name: fedora-39

src/go-build-wrapper

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,48 @@ fi
7070

7171
dynamic_linker="/run/host$dynamic_linker_canonical_dirname/$dynamic_linker_basename"
7272

73+
# Note for distributors:
74+
#
75+
# The '-z now' flag, which is the opposite of '-z lazy', is unsupported as an
76+
# external linker flag [1], because of how the NVIDIA Container Toolkit stack
77+
# uses dlopen(3) to load libcuda.so.1 and libnvidia-ml.so.1 at runtime [2,3].
78+
#
79+
# The NVIDIA Container Toolkit stack doesn't use dlsym(3) to obtain the address
80+
# of a symbol at runtime before using it. It links against undefined symbols
81+
# at build-time available through a CUDA API definition embedded directly in
82+
# the CGO code or a copy of nvml.h. It relies upon lazily deferring function
83+
# call resolution to the point when dlopen(3) is able to load the shared
84+
# libraries at runtime, instead of doing it when toolbox(1) is started.
85+
#
86+
# This is unlike how Toolbx itself uses dlopen(3) and dlsym(3) to load
87+
# libsubid.so at runtime.
88+
#
89+
# Compare the output of:
90+
# $ nm /path/to/toolbox | grep ' subid_init'
91+
#
92+
# ... with those from:
93+
# $ nm /path/to/toolbox | grep ' nvmlGpuInstanceGetComputeInstanceProfileInfoV'
94+
# U nvmlGpuInstanceGetComputeInstanceProfileInfoV
95+
# $ nm /path/to/toolbox | grep ' nvmlDeviceGetAccountingPids'
96+
# U nvmlDeviceGetAccountingPids
97+
#
98+
# Using '-z now' as an external linker flag forces the dynamic linker to
99+
# resolve all symbols when toolbox(1) is started, and leads to:
100+
# $ toolbox
101+
# toolbox: symbol lookup error: toolbox: undefined symbol:
102+
# nvmlGpuInstanceGetComputeInstanceProfileInfoV
103+
#
104+
# [1] NVIDIA Container Toolkit commit 1407ace94ab7c150
105+
# https://github.com/NVIDIA/nvidia-container-toolkit/commit/1407ace94ab7c150
106+
# https://github.com/NVIDIA/go-nvml/issues/18
107+
# https://github.com/NVIDIA/nvidia-container-toolkit/issues/49
108+
#
109+
# [2] https://github.com/NVIDIA/nvidia-container-toolkit/tree/main/internal/cuda
110+
#
111+
# [3] https://github.com/NVIDIA/go-nvml/blob/main/README.md
112+
# https://github.com/NVIDIA/go-nvml/tree/main/pkg/dl
113+
# https://github.com/NVIDIA/go-nvml/tree/main/pkg/nvml
114+
73115
# shellcheck disable=SC2086
74116
go build \
75117
$tags \

0 commit comments

Comments
 (0)