Skip to content

Ensure that IPC sockets are not mounted read-only#1593

Merged
elezar merged 1 commit intoNVIDIA:mainfrom
faganihajizada:fix/remove-ro-from-ipc-mounts
Jan 20, 2026
Merged

Ensure that IPC sockets are not mounted read-only#1593
elezar merged 1 commit intoNVIDIA:mainfrom
faganihajizada:fix/remove-ro-from-ipc-mounts

Conversation

@faganihajizada
Copy link
Contributor

Problem

CDI spec generation mounts IPC sockets (nvidia-persistenced, nvidia-fabricmanager ...) with the ro (read-only) mount option. This breaks nested container runtimes like enroot/pyxis that need to bind-mount these sockets into containers.

When we try to run slurm job with enroot/pyxis on K8s:

pyxis: imported docker image: nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04
error: pyxis: container start failed with error code: 1
error: pyxis: printing enroot log file:
error: pyxis:     nvidia-container-cli: mount error: mount operation failed: 
                  /tmp/enroot/data/user-0/pyxis_16.0/run/nvidia-persistenced/socket: operation not permitted
error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

Root Cause

The ro option is inherited from the default mount options in mounts.go, but IPC sockets should not be read-only. This is inconsistent with libnvidia-container which does not use MS_RDONLY for IPC mounts (reference).

Fix

Define IPC-specific mount options in ipc.go that exclude ro, matching libnvidia-container behavior:

var ipcMountOptions = []string{
    "nosuid",
    "nodev", 
    "rbind",
    "rprivate",
    "noexec",
}

Testing

  • Unit test updated and passing
  • Manually verified on AWS EKS with host-installed NVIDIA drivers
  • Confirmed enroot container starts successfully after fix

Additional Context

I tested two AWS EKS clusters with identical GPU Operator versions:

  • When GPU Operator manages drivers, nvidia-persistenced runs inside the driver container. The socket is part of the overlay filesystem and it works fine.
  • When using host-installed drivers (e.g., AWS AMI), CDI discovers the host socket and mounts it as tmpfs (ro)

This is critical to support SlurmonK8s (https://github.com/SlinkyProject/slurm-operator) with enroot/pyxis on K8s.

IPC sockets (nvidia-persistenced, nvidia-fabricmanager, nvidia-mps)
no longer include the "ro" mount option. This matches the behavior
of libnvidia-container and allows nested container runtimes like
enroot to bind-mount these sockets.

Signed-off-by: Fagani Hajizada <fhajizada@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks.

@elezar
Copy link
Member

elezar commented Jan 20, 2026

/cherry-pick release-1.18

@elezar elezar added this to the next-patch milestone Jan 20, 2026
@elezar elezar modified the milestones: next-patch, next-minor Jan 20, 2026
@coveralls
Copy link

Pull Request Test Coverage Report for Build 21167707185

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 36.837%

Totals Coverage Status
Change from base Build 21132355090: 0.0%
Covered Lines: 5257
Relevant Lines: 14271

💛 - Coveralls

@elezar
Copy link
Member

elezar commented Jan 20, 2026

/ok-to-test fb15d14

Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@elezar elezar changed the title Remove ro mount option from IPC sockets Ensure that IPC sockets are not mounted read-only Jan 20, 2026
@elezar elezar merged commit f499dd7 into NVIDIA:main Jan 20, 2026
16 checks passed
@github-actions
Copy link

🤖 Backport PR created for release-1.18: #1594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants