You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OpenShell k3s cluster ships nvidia-device-plugin with deviceListStrategy: cdi-cri as default
CDI strategy is only supported on NVML-based systems — Jetson/Tegra uses nvgpu (unified memory) driver, not NVML
Device plugin crashes in CrashLoopBackOff, GPU never registered with kubelet
k3s node has 0 allocatable GPUs — every NemoClaw sandbox creation fails
Observed
Step 1: Run nemoclaw onboard --yes-i-accept-third-party-software on Jetson Thor/Orin (R39.1)
Step 2: Gateway starts successfully, inference configured, sandbox image builds and uploads to gateway
Step 3: Sandbox creation fails with:
Error: status: FailedPrecondition, message: "GPU sandbox requested, but the
active gateway has no allocatable GPUs."
Step 4: Check device plugin pod:
kubectl get pods -n nvidia-device-plugin
NAME READY STATUS RESTARTS
nvidia-device-plugin-xxxxx 0/1 CrashLoopBackOff 8+
Step 5: Check device plugin logs:
kubectl logs -n nvidia-device-plugin nvidia-device-plugin-xxxxx
E main.go:185] error starting plugins: unable to validate flags:
CDI --device-list-strategy options are only supported on NVML-based systems
Step 6: Check node allocatable — GPU missing:
kubectl get nodes -o jsonpath="{.items[0].status.allocatable}"
{"cpu":"14","memory":"128828160Ki","pods":"110"} # nvidia.com/gpu absent
Step 7: Check Helm chart values that cause the crash:
kubectl get helmchart nvidia-device-plugin -n kube-system -o yaml | grep valuesContent -A5
valuesContent: |-
runtimeClassName: nvidia
deviceListStrategy: cdi-cri <-- wrong for Jetson
deviceIDStrategy: index
Expected Result
OpenShell should detect Jetson/Tegra platform and use deviceListStrategy: envvar
Device plugin should start successfully and register nvidia.com/gpu: 1 with kubelet
nemoclaw onboard should complete without manual patches on Jetson
Root Cause
OpenShell Helm chart hardcodes deviceListStrategy: cdi-cri — works on x86+NVML, fails on Jetson/Tegra
Description
Problem
- OpenShell k3s cluster ships nvidia-device-plugin with
- CDI strategy is only supported on NVML-based systems — Jetson/Tegra uses nvgpu (unified memory) driver, not NVML
- Device plugin crashes in CrashLoopBackOff, GPU never registered with kubelet
- k3s node has 0 allocatable GPUs — every NemoClaw sandbox creation fails
ObserveddeviceListStrategy: cdi-crias defaultnemoclaw onboard --yes-i-accept-third-party-softwareon Jetson Thor/Orin (R39.1)kubectl get nodes -o jsonpath="{.items[0].status.allocatable}" {"cpu":"14","memory":"128828160Ki","pods":"110"} # nvidia.com/gpu absent- Step 7: Check Helm chart values that cause the crash:
Expected Result- OpenShell should detect Jetson/Tegra platform and use
- Device plugin should start successfully and register
Root CausedeviceListStrategy: envvarnvidia.com/gpu: 1with kubeletnemoclaw onboardshould complete without manual patches on Jetson- OpenShell Helm chart hardcodes
- Additionally, the Helm repo URL
- Jetson uses
- BSP does provide
Workaround (must be applied after every gateway creation)deviceListStrategy: cdi-cri— works on x86+NVML, fails on Jetson/Tegrahttps://nvidia.github.io/k8s-device-pluginwas dead (404) — device plugin could not be installed at all on some versions (fixed in NemoClaw # fix(gateway): replace dead nvidia-device-plugin Helm repo URL #241)nvgpuunified memory driver — CDI and NVML not applicabledevices.csvand CDI spec can be generated — but device plugin fails before using themnemoclaw onboardstarts gateway (step 2/8), run the following before sandbox creation:# Step 1: Fix dead Helm repo URL (download chart from GitHub directly) curl -sL "https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.18.2/nvidia-device-plugin-0.18.2.tgz" -o /tmp/nvidia-device-plugin.tgz CHART_B64=$(base64 -w0 /tmp/nvidia-device-plugin.tgz) openshell doctor exec -- kubectl patch helmchart nvidia-device-plugin -n kube-system --type merge --patch "{"spec":{"chartContent":"${CHART_B64}","repo":""}}"ImpactStep 2: Label node so DaemonSet schedules
openshell doctor exec -- kubectl label node openshell-nemoclaw nvidia.com/gpu.present=true --overwrite
Step 3: Override device strategies for Tegra
openshell doctor exec -- kubectl set env daemonset/nvidia-device-plugin -n nvidia-device-plugin DEVICE_LIST_STRATEGY=envvar DEVICE_DISCOVERY_STRATEGY=tegra
Step 4: Wait for pod restart and verify
sleep 20
openshell doctor exec -- kubectl get pods -n nvidia-device-plugin
openshell doctor exec -- kubectl get nodes -o jsonpath="{.items[0].status.allocatable}" | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print('GPU:', d.get('nvidia.com/gpu','NOT FOUND'))"
Expected: GPU: 1
Step 5: Retry sandbox creation
rm -f ~/.nemoclaw/onboard.lock
nemoclaw onboard --yes-i-accept-third-party-software --resume
- Blocks GPU sandbox creation on ALL Jetson/Tegra devices (Thor, Orin, Orin NX)
- Every NemoClaw user on Jetson hits this with no documented workaround in official docs
- Workaround must be reapplied every time the gateway is recreated
Build Details- AGX Thor Dev Kit — R39.1, GCID 45375976, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36
- AGX Orin Dev Kit — R39.1, GCID 45334168, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36
RelatedBug Details
[NVB#6164762]