Skip to content

[NemoClaw][OpenShell][R39][Jetson] Nemoclaw install on Jetson fails as GPU passthrough mode is not supported #3710

@coder-glenn

Description

@coder-glenn

Description

Problem

  • OpenShell k3s cluster ships nvidia-device-plugin with deviceListStrategy: cdi-cri as default
  • CDI strategy is only supported on NVML-based systems — Jetson/Tegra uses nvgpu (unified memory) driver, not NVML
  • Device plugin crashes in CrashLoopBackOff, GPU never registered with kubelet
  • k3s node has 0 allocatable GPUs — every NemoClaw sandbox creation fails
Observed
  • Step 1: Run nemoclaw onboard --yes-i-accept-third-party-software on Jetson Thor/Orin (R39.1)
  • Step 2: Gateway starts successfully, inference configured, sandbox image builds and uploads to gateway
  • Step 3: Sandbox creation fails with:
Error: status: FailedPrecondition, message: "GPU sandbox requested, but the
active gateway has no allocatable GPUs."
  • Step 4: Check device plugin pod:
kubectl get pods -n nvidia-device-plugin
NAME                         READY   STATUS             RESTARTS
nvidia-device-plugin-xxxxx   0/1     CrashLoopBackOff   8+
  • Step 5: Check device plugin logs:
kubectl logs -n nvidia-device-plugin nvidia-device-plugin-xxxxx
E main.go:185] error starting plugins: unable to validate flags:
CDI --device-list-strategy options are only supported on NVML-based systems
  • Step 6: Check node allocatable — GPU missing:
kubectl get nodes -o jsonpath="{.items[0].status.allocatable}"
{"cpu":"14","memory":"128828160Ki","pods":"110"}  # nvidia.com/gpu absent
  • Step 7: Check Helm chart values that cause the crash:
kubectl get helmchart nvidia-device-plugin -n kube-system -o yaml | grep valuesContent -A5
valuesContent: |-
  runtimeClassName: nvidia
  deviceListStrategy: cdi-cri   <-- wrong for Jetson
  deviceIDStrategy: index
Expected Result
  • OpenShell should detect Jetson/Tegra platform and use deviceListStrategy: envvar
  • Device plugin should start successfully and register nvidia.com/gpu: 1 with kubelet
  • nemoclaw onboard should complete without manual patches on Jetson
Root Cause
  • OpenShell Helm chart hardcodes deviceListStrategy: cdi-cri — works on x86+NVML, fails on Jetson/Tegra
  • Additionally, the Helm repo URL https://nvidia.github.io/k8s-device-plugin was dead (404) — device plugin could not be installed at all on some versions (fixed in NemoClaw # fix(gateway): replace dead nvidia-device-plugin Helm repo URL #241)
  • Jetson uses nvgpu unified memory driver — CDI and NVML not applicable
  • BSP does provide devices.csv and CDI spec can be generated — but device plugin fails before using them
Workaround (must be applied after every gateway creation)
  • After nemoclaw onboard starts gateway (step 2/8), run the following before sandbox creation:
# Step 1: Fix dead Helm repo URL (download chart from GitHub directly)
curl -sL "https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.18.2/nvidia-device-plugin-0.18.2.tgz"   -o /tmp/nvidia-device-plugin.tgz
CHART_B64=$(base64 -w0 /tmp/nvidia-device-plugin.tgz)
openshell doctor exec -- kubectl patch helmchart nvidia-device-plugin   -n kube-system --type merge   --patch "{"spec":{"chartContent":"${CHART_B64}","repo":""}}"

Step 2: Label node so DaemonSet schedules

openshell doctor exec -- kubectl label node openshell-nemoclaw nvidia.com/gpu.present=true --overwrite

Step 3: Override device strategies for Tegra

openshell doctor exec -- kubectl set env daemonset/nvidia-device-plugin -n nvidia-device-plugin DEVICE_LIST_STRATEGY=envvar DEVICE_DISCOVERY_STRATEGY=tegra

Step 4: Wait for pod restart and verify

sleep 20
openshell doctor exec -- kubectl get pods -n nvidia-device-plugin
openshell doctor exec -- kubectl get nodes -o jsonpath="{.items[0].status.allocatable}" | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print('GPU:', d.get('nvidia.com/gpu','NOT FOUND'))"

Expected: GPU: 1

Step 5: Retry sandbox creation

rm -f ~/.nemoclaw/onboard.lock
nemoclaw onboard --yes-i-accept-third-party-software --resume

Impact

  • Blocks GPU sandbox creation on ALL Jetson/Tegra devices (Thor, Orin, Orin NX)
  • Every NemoClaw user on Jetson hits this with no documented workaround in official docs
  • Workaround must be reapplied every time the gateway is recreated
Build Details
  • AGX Thor Dev Kit — R39.1, GCID 45375976, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36
  • AGX Orin Dev Kit — R39.1, GCID 45334168, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36
Related

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Engineer Aaron Erickson
Requester Abhishek Fatale
Keyword NemoClaw
Days Open 7

[NVB#6164762]

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions