Skip to content

[DGX Spark/Station][Install] Docker-driver GPU patch fails /proc/comm write — sandbox creation aborts with exit 1 #3511

@zNeill

Description

@zNeill

Description

Description

On DGX Spark (spark-6087), NemoClaw v0.0.41 fresh install with GPU auto-
detection enabled fails during the GPU proof verification step. The Docker-
driver GPU patch flow (create sandbox → stop → recreate with --gpus all)
completes the recreate, but the subsequent /proc/$$/task/$$/comm write
probe returns "Permission denied" inside the sandbox container. The GPU
proof check treats this as fatal and throws, aborting the entire install
with exit code 1. nvidia-smi proof passes; only the /proc/comm probe fails.

Bare Docker test (docker run --rm --gpus all ubuntu sh -c "echo test >
/proc/1/task/1/comm") succeeds on the same host, indicating the issue is
specific to how OpenShell sandbox containers are recreated during the GPU
patch flow — the recreated container may be missing security options
(apparmor=unconfined) or capabilities (SYS_PTRACE) that the normal create
path includes.
Environment
Device:        DGX Spark (spark-6087, NVIDIA GB10)
OS:            Ubuntu (kernel 6.11.0-1014-nvidia)
Architecture:  aarch64
Node.js:       v22.22.1
npm:           10.9.4
Docker:        28.3.3 (containerd v0.26.1, runc v1.2.5)
nvidia-ctk:    1.17.8
OpenShell CLI: 0.0.39
NemoClaw:      v0.0.41
OpenClaw:      N/A (install failed before sandbox ready)
Steps to Reproduce
1. On DGX Spark where nvidia-smi correctly reports GPU (e.g. "NVIDIA GB10"):
   export NEMOCLAW_NON_INTERACTIVE=1
   export NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE=1
   export NEMOCLAW_INSTALL_TAG=v0.0.41
2. Run: curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
3. Observe onboard proceeds through [1/8]-[6/8], sandbox image built
4. GPU patch activates: "Docker-driver GPU patch active; creating sandbox
   first, then recreating the Docker container with GPU access"
5. GPU proof runs
Expected Result
GPU proof passes all three checks (nvidia-smi, /proc/comm write, cuInit)
and sandbox creation completes successfully.
Actual Result
✓ GPU proof passed: nvidia-smi when available
✗ GPU proof failed: /proc//task//comm write
  sh: 1: cannot create /proc/182/task/182/comm: Permission denied

Error: GPU proof failed: /proc//task//comm write (status 2):
  sh: 1: cannot create /proc/182/task/182/comm: Permission denied

Install exits with code 1. No sandbox is usable.
Logs
From CI pipeline 51235361 on runner spark-6087:

Recreating OpenShell Docker sandbox container with NVIDIA GPU access...
Error: sandbox 'my-assistant' is not ready (phase: Provisioning)
✓ Docker GPU mode selected: --gpus all
✓ GPU proof passed: nvidia-smi when available
✗ GPU proof failed: /proc//task//comm write
  sh: 1: cannot create /proc/182/task/182/comm: Permission denied

Comparison: bare Docker GPU container on same host:
  docker run --rm --gpus all ubuntu sh -c "echo test > /proc/1/task/1/comm"
  → exit=0 (succeeds)

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Install, NemoClaw_Sandbox, NemoClaw-SWQA-RelBlckr-Recommended

[NVB#6175942]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA TeamUATIssues flagged for User Acceptance Testing.VDRLinked to VDR findingarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: containerAffects Docker, containerd, Podman, or imagesplatform: dgx-sparkAffects DGX Spark hardware or workflows

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions