[NemoClaw][OpenShell][R39][Jetson] Nemoclaw install on Jetson fails as GPU passthrough mode is not supported

## Description

Problem
<ul><li>OpenShell k3s cluster ships nvidia-device-plugin with <code>deviceListStrategy: cdi-cri</code> as default</li><li>CDI strategy is only supported on NVML-based systems — Jetson/Tegra uses nvgpu (unified memory) driver, not NVML</li><li>Device plugin crashes in CrashLoopBackOff, GPU never registered with kubelet</li><li>k3s node has 0 allocatable GPUs — every NemoClaw sandbox creation fails</li></ul>
Observed
<ul><li>Step 1: Run <code>nemoclaw onboard --yes-i-accept-third-party-software</code> on Jetson Thor/Orin (R39.1)</li><li>Step 2: Gateway starts successfully, inference configured, sandbox image builds and uploads to gateway</li><li>Step 3: Sandbox creation fails with:</li></ul>
<pre>Error: status: FailedPrecondition, message: "GPU sandbox requested, but the
active gateway has no allocatable GPUs."
</pre><ul><li>Step 4: Check device plugin pod:</li></ul>
<pre>kubectl get pods -n nvidia-device-plugin
NAME READY STATUS RESTARTS
nvidia-device-plugin-xxxxx 0/1 CrashLoopBackOff 8+
</pre><ul><li>Step 5: Check device plugin logs:</li></ul>
<pre>kubectl logs -n nvidia-device-plugin nvidia-device-plugin-xxxxx
E main.go:185] error starting plugins: unable to validate flags:
CDI --device-list-strategy options are only supported on NVML-based systems
</pre><ul><li>Step 6: Check node allocatable — GPU missing:</li></ul>
<pre>kubectl get nodes -o jsonpath="{.items[0].status.allocatable}"
{"cpu":"14","memory":"128828160Ki","pods":"110"} # nvidia.com/gpu absent
</pre><ul><li>Step 7: Check Helm chart values that cause the crash:</li></ul>
<pre>kubectl get helmchart nvidia-device-plugin -n kube-system -o yaml | grep valuesContent -A5
valuesContent: |-
 runtimeClassName: nvidia
 deviceListStrategy: cdi-cri <-- wrong for Jetson
 deviceIDStrategy: index
</pre>Expected Result
<ul><li>OpenShell should detect Jetson/Tegra platform and use <code>deviceListStrategy: envvar</code></li><li>Device plugin should start successfully and register <code>nvidia.com/gpu: 1</code> with kubelet</li><li><code>nemoclaw onboard</code> should complete without manual patches on Jetson</li></ul>
Root Cause
<ul><li>OpenShell Helm chart hardcodes <code>deviceListStrategy: cdi-cri</code> — works on x86+NVML, fails on Jetson/Tegra</li><li>Additionally, the Helm repo URL <code>https://nvidia.github.io/k8s-device-plugin</code> was dead (404) — device plugin could not be installed at all on some versions (fixed in NemoClaw #241)</li><li>Jetson uses <code>nvgpu</code> unified memory driver — CDI and NVML not applicable</li><li>BSP does provide <code>devices.csv</code> and CDI spec can be generated — but device plugin fails before using them</li></ul>
Workaround (must be applied after every gateway creation)
<ul><li>After <code>nemoclaw onboard</code> starts gateway (step 2/8), run the following before sandbox creation:</li></ul>
<pre># Step 1: Fix dead Helm repo URL (download chart from GitHub directly)
curl -sL "https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.18.2/nvidia-device-plugin-0.18.2.tgz" -o /tmp/nvidia-device-plugin.tgz
CHART_B64=$(base64 -w0 /tmp/nvidia-device-plugin.tgz)
openshell doctor exec -- kubectl patch helmchart nvidia-device-plugin -n kube-system --type merge --patch "{"spec":{"chartContent":"${CHART_B64}","repo":""}}"

# Step 2: Label node so DaemonSet schedules
openshell doctor exec -- kubectl label node openshell-nemoclaw nvidia.com/gpu.present=true --overwrite

# Step 3: Override device strategies for Tegra
openshell doctor exec -- kubectl set env daemonset/nvidia-device-plugin -n nvidia-device-plugin DEVICE_LIST_STRATEGY=envvar DEVICE_DISCOVERY_STRATEGY=tegra

# Step 4: Wait for pod restart and verify
sleep 20
openshell doctor exec -- kubectl get pods -n nvidia-device-plugin
openshell doctor exec -- kubectl get nodes -o jsonpath="{.items[0].status.allocatable}" | python3 -c "import sys,json; d=json.loads(sys.stdin.read()); print('GPU:', d.get('nvidia.com/gpu','NOT FOUND'))"
# Expected: GPU: 1

# Step 5: Retry sandbox creation
rm -f ~/.nemoclaw/onboard.lock
nemoclaw onboard --yes-i-accept-third-party-software --resume
</pre>Impact
<ul><li>Blocks GPU sandbox creation on ALL Jetson/Tegra devices (Thor, Orin, Orin NX)</li><li>Every NemoClaw user on Jetson hits this with no documented workaround in official docs</li><li>Workaround must be reapplied every time the gateway is recreated</li></ul>
Build Details
<ul><li>AGX Thor Dev Kit — R39.1, GCID 45375976, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36</li><li>AGX Orin Dev Kit — R39.1, GCID 45334168, kernel 6.8.12-1021-tegra, NemoClaw v0.0.17, OpenShell 0.0.36</li></ul>
Related
<ul><li>GitHub NemoClaw #404: Jetson Orin Nano GPU detection (open PR #405)</li><li>GitHub NemoClaw #241: Dead Helm repo URL (closed)</li><li>NVBug 6034621: NO GPU Detected when onboarding</li><li>NVBug 6164633: Docker Hub 429 blocks NemoClaw onboard</li><li>NVBug 6164625: Previous draft (superseded by this bug)</li></ul>

## Bug Details

| Field | Value |
|-------|-------|
| Priority | Unprioritized |
| Action | Dev - Open - To fix |
| Disposition | Open issue |
| Module | Machine Learning - NemoClaw |
| Engineer | Aaron Erickson |
| Requester | Abhishek Fatale |
| Keyword | NemoClaw |
| Days Open | 7 |

---
[NVB#6164762]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NemoClaw][OpenShell][R39][Jetson] Nemoclaw install on Jetson fails as GPU passthrough mode is not supported #3710

Description

Step 2: Label node so DaemonSet schedules

Step 3: Override device strategies for Tegra

Step 4: Wait for pod restart and verify

Expected: GPU: 1

Step 5: Retry sandbox creation

Bug Details

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field	Value
Priority	Unprioritized
Action	Dev - Open - To fix
Disposition	Open issue
Module	Machine Learning - NemoClaw
Engineer	Aaron Erickson
Requester	Abhishek Fatale
Keyword	NemoClaw
Days Open	7

[NemoClaw][OpenShell][R39][Jetson] Nemoclaw install on Jetson fails as GPU passthrough mode is not supported #3710

Description

Description

Step 2: Label node so DaemonSet schedules

Step 3: Override device strategies for Tegra

Step 4: Wait for pod restart and verify

Expected: GPU: 1

Step 5: Retry sandbox creation

Bug Details

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions