Skip to content

With most recent Ubuntu packages upgrade, enroot container load fails #232

@itzsimpl

Description

@itzsimpl

We have a DGX H100 system, and we're running Slurm with latest Enroot/Pyxis. Since the most recent upgrade of the nvidia kernel, nvidia-container-toolkit and other ubuntu packages Enroot fails to load containers.

Log from apt upgrade

Start-Date: 2025-05-17  11:06:31
Commandline: apt upgrade -y
Requested-By: ubuntu (1000)
Install: linux-image-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-tools-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-tools-5.15.0-1078:amd64 (5.15.0-1078.79, automatic), linux-modules-extra-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-headers-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-modules-nvidia-fs-5.15.0-1078-nvidia:amd64 (5.15.0-1078.79, automatic), linux-nvidia-headers-5.15.0-1078:amd64 (5.15.0-1078.79, automatic)
Upgrade: linux-tools-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), openjdk-11-jre:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), linux-image-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), python2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-tools-common:amd64 (5.15.0-139.149, 5.15.0-140.150), openjdk-11-jre-headless:amd64 (11.0.26+4-1ubuntu1~22.04, 11.0.27+6~us1-0ubuntu1~22.04), libldap-common:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), libnvidia-container1:amd64 (1.17.6-1, 1.17.7-1), libldap-2.5-0:amd64 (2.5.18+dfsg-0ubuntu0.22.04.3, 2.5.19+dfsg-0ubuntu0.22.04.1), linux-crashdump:amd64 (5.15.0.139.135, 5.15.0.140.135), linux-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), linux-headers-nvidia:amd64 (5.15.0.1077.77, 5.15.0.1078.78), open-vm-tools:amd64 (2:12.3.5-3~ubuntu0.22.04.1, 2:12.3.5-3~ubuntu0.22.04.2), libnvidia-container-tools:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit:amd64 (1.17.6-1, 1.17.7-1), nvidia-container-toolkit-base:amd64 (1.17.6-1, 1.17.7-1), python2.7:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-minimal:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), libpython2.7-stdlib:amd64 (2.7.18-13ubuntu1.5+esm3, 2.7.18-13ubuntu1.5+esm5), linux-libc-dev:amd64 (5.15.0-139.149, 5.15.0-140.150)
End-Date: 2025-05-17  11:09:05
# nvidia-container-toolkit -version
NVIDIA Container Runtime Hook version 1.17.7
commit: bae3e7842ebe26812d8bd6a9be6a14a83dc91d8f

The error is

# srun -c8 --mem 16G --gpus 1 --container-image nvcr.io/nvidia/nemo:25.04 --pty bash
pyxis: importing docker image: nvcr.io/nvidia/nemo:25.04
pyxis: imported docker image: nvcr.io/nvidia/nemo:25.04
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: operation not permitted
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: gn01: task 0: Exited with exit code 1

Adding no_cgroups = true to /etc/nvidia-container-runtime/config.toml like https://docs.nvidia.com/ai-enterprise/deployment/cpu-only/latest/runtimes.html#rootless-container-setup-optional and NVIDIA/libnvidia-container#154, does not help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions