Use user from OCI config per default by saschagrunert · Pull Request #2 · NVIDIA/nvidia-container-toolkit

saschagrunert · 2020-04-20T13:03:05Z

We have to use the user from the OCI configuration to have the right set
of user permissions inside container.

We have to use the user from the OCI configuration to have the right set of user permissions inside container. Signed-off-by: Sascha Grunert <sgrunert@suse.com>

RenaudWasTaken · 2020-04-20T20:24:35Z

Hello @saschagrunert !

Thanks for your contribution! All the github repos are mirror of repos on gitlab, do you mind making your contribution here: https://gitlab.com/nvidia/container-toolkit/container-toolkit

Thanks!

saschagrunert · 2020-04-21T08:23:56Z

Yes sure 👍 Thank you for the hint

saschagrunert · 2020-04-21T08:24:38Z

Ref: https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/5

This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (`state <container-id>`) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant. This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the `Init:RunContainerError` state for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message: ``` level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in NVIDIA#2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer ``` This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com>

This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (`state <container-id>`) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant. This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the `Init:RunContainerError` state for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message: ``` level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in #2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer ``` This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler. Signed-off-by: Christopher Desiniotis <cdesiniotis@nvidia.com> (cherry picked from commit 61f9bde)

Use user from OCI config per default

ae69696

We have to use the user from the OCI configuration to have the right set of user permissions inside container. Signed-off-by: Sascha Grunert <sgrunert@suse.com>

saschagrunert closed this Apr 21, 2020

saschagrunert deleted the user branch April 21, 2020 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use user from OCI config per default#2

Use user from OCI config per default#2
saschagrunert wants to merge 1 commit intoNVIDIA:masterfrom
saschagrunert:user

saschagrunert commented Apr 20, 2020

Uh oh!

RenaudWasTaken commented Apr 20, 2020

Uh oh!

saschagrunert commented Apr 21, 2020

Uh oh!

saschagrunert commented Apr 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saschagrunert commented Apr 20, 2020

Uh oh!

RenaudWasTaken commented Apr 20, 2020

Uh oh!

saschagrunert commented Apr 21, 2020

Uh oh!

saschagrunert commented Apr 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants