Redirect log message to stderr in nvidia runtime wrapper script by cdesiniotis · Pull Request #1400 · NVIDIA/nvidia-container-toolkit

cdesiniotis · 2025-10-30T22:55:48Z

This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (state <container-id>) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant.

This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the Init:RunContainerError state for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message:

level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in #2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer

This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler.

This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (`state <container-id>`) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant. This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the `Init:RunContainerError` state for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message: ``` level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in NVIDIA#2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer ``` This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

elezar · 2025-10-31T14:54:01Z

/cherry-pick release-1.18

elezar

Thanks @cdesiniotis.

github-actions · 2025-11-03T22:44:39Z

🤖 Backport PR created for release-1.18: #1412 ✅

cdesiniotis requested review from elezar and tariq1890 October 30, 2025 22:55

cdesiniotis added the must-backport The changes in PR need to be backported to at least one stable release branch. label Oct 30, 2025

cdesiniotis force-pushed the write-to-stderr-in-wrapper-script branch from 6f7fb59 to 61f9bde Compare October 30, 2025 23:39

github-actions bot added the cherry-pick/release-1.18 label Oct 31, 2025

elezar approved these changes Oct 31, 2025

View reviewed changes

elezar added this to the v1.18.1 milestone Oct 31, 2025

tariq1890 approved these changes Nov 1, 2025

View reviewed changes

cdesiniotis merged commit 3b2ceb8 into NVIDIA:main Nov 3, 2025
13 checks passed

github-actions bot mentioned this pull request Nov 3, 2025

[release-1.18] Redirect log message to stderr in nvidia runtime wrapper script #1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirect log message to stderr in nvidia runtime wrapper script#1400

Redirect log message to stderr in nvidia runtime wrapper script#1400
cdesiniotis merged 1 commit intoNVIDIA:mainfrom
cdesiniotis:write-to-stderr-in-wrapper-script

cdesiniotis commented Oct 30, 2025

Uh oh!

elezar commented Oct 31, 2025

Uh oh!

elezar left a comment

Uh oh!

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cdesiniotis commented Oct 30, 2025

Uh oh!

elezar commented Oct 31, 2025

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants