# fix(gateway): replace dead nvidia-device-plugin Helm repo URL

# fix(gateway): replace dead nvidia-device-plugin Helm repo URL

## Summary

The NVIDIA k8s-device-plugin Helm chart repository
(`https://nvidia.github.io/k8s-device-plugin`) no longer resolves — it
returns HTTP 404. This URL is hardcoded in the k3s addon manifest baked
into the `ghcr.io/nvidia/openshell/cluster` gateway image. As a result,
every GPU-enabled gateway startup fails to install the device plugin,
making GPUs un-allocatable and causing all GPU sandbox creation attempts
to fail.

---

## Affected file

```
ghcr.io/nvidia/openshell/cluster:0.0.8
  └── /var/lib/rancher/k3s/server/manifests/nvidia-device-plugin-helmchart.yaml
```

---

## Root cause

The NVIDIA k8s-device-plugin project previously hosted its Helm chart
index on GitHub Pages at `https://nvidia.github.io/k8s-device-plugin`.
That endpoint has been taken down (returns 404). The k3s Helm controller
inside the gateway container tries to reach it on every gateway start,
fails, and enters a CrashLoopBackOff retry cycle. The device plugin
DaemonSet is never deployed, so `nvidia.com/gpu` resources are never
advertised to the scheduler.

---

## Affected scenarios

| Scenario | Affected? | Why |
|---|---|---|
| `nemoclaw onboard` — first run | **Yes** | Gateway starts fresh; Helm install fails immediately |
| `nemoclaw onboard` — re-run / recreate sandbox | **Yes** | Onboard always restarts the gateway, resetting the cluster state |
| Any system restart that causes the gateway container to restart | **Yes** | Same k3s reinitialisation, same failure |
| Non-GPU sandboxes | No | `nvidia-device-plugin-helmchart.yaml` is only copied into manifests when `GPU_ENABLED=true` |
| Air-gapped / offline deployments | **Yes** (additionally) | Even if the URL were live, an offline environment would have no route to GitHub Pages |
| Cloud instances with H100/H200/A100/etc. | **Yes** | Any instance class that passes the GPU preflight check will request a GPU sandbox and hit this |

---

## Symptom

```
Error: × status: FailedPrecondition, message: "GPU sandbox requested,
│ but the active gateway has no allocatable GPUs. Please refer to
│ documentation and use `openshell doctor` commands to inspect GPU
│ support and gateway configuration."
```

Confirmed via:

```
$ openshell doctor exec -- kubectl get pods -n kube-system | grep nvidia
helm-install-nvidia-device-plugin-xxxxx   0/1   Error   5 (2m ago)   4m

$ openshell doctor exec -- kubectl logs -n kube-system \
    helm-install-nvidia-device-plugin-xxxxx
...
+ helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
Error: looks like "https://nvidia.github.io/k8s-device-plugin" is not a valid
chart repository or cannot be reached: failed to fetch
https://nvidia.github.io/k8s-device-plugin/index.yaml : 404 Not Found
```

---

## Proposed fix

Replace `spec.repo` + `spec.chart` with a direct absolute URL pointing
to the chart tgz on the NVIDIA k8s-device-plugin GitHub release. k3s's
Helm controller accepts an absolute HTTPS URL in `spec.chart` and skips
the `helm repo add` step entirely.

**Before** (`nvidia-device-plugin-helmchart.yaml`):

```yaml
spec:
  repo: https://nvidia.github.io/k8s-device-plugin
  chart: nvidia-device-plugin
  version: "0.18.2"
```

**After**:

```yaml
spec:
  chart: https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.18.2/nvidia-device-plugin-0.18.2.tgz
  # repo and version removed — not used with absolute chart URL
```

### Alternative: embed the chart in the image (`spec.chartContent`)

If internet access at gateway startup cannot be guaranteed (e.g.
air-gapped deployments), bundle the chart tgz inside the container image
and use `spec.chartContent` (base64-encoded tgz). This eliminates all
outbound Helm fetching:

```dockerfile
COPY nvidia-device-plugin-0.18.2.tgz /opt/charts/
```

```yaml
spec:
  chartContent: <base64 of nvidia-device-plugin-0.18.2.tgz>
```

This is the approach used in the workaround script below, and it works
reliably regardless of network conditions.

---

## Workaround (until the image is updated)

Apply after `nemoclaw onboard` step 2 (gateway ready), before step 3
(sandbox creation):

```bash
CHART_VERSION=\"0.18.2\"
CHART_TGZ=\"/tmp/nvidia-device-plugin-\${CHART_VERSION}.tgz\"

curl -sL \\
  \"https://github.com/NVIDIA/k8s-device-plugin/releases/download/v\${CHART_VERSION}/nvidia-device-plugin-\${CHART_VERSION}.tgz\" \\
  -o \"\$CHART_TGZ\"

CHART_B64=\$(base64 -w0 \"\$CHART_TGZ\")

openshell doctor exec -- kubectl patch helmchart nvidia-device-plugin \\
  -n kube-system \\
  --type merge \\
  --patch \"{\\\"spec\\\":{\\\"chartContent\\\":\\\"${CHART_B64}\\\",\\\"repo\\\":\\\"\\\"}}\"
```

Or use the wrapper script `nemoclaw-onboard.sh` (provided separately)
which applies this patch automatically in the background during
`nemoclaw onboard`.

---

## Test plan

- [ ] Fresh `nemoclaw onboard` on an NVIDIA GPU instance reaches step 3
  without \"no allocatable GPUs\" error
- [ ] `openshell doctor exec -- kubectl get pods -n nvidia-device-plugin`
  shows device plugin and GFD pods `Running`
- [ ] `openshell doctor exec -- kubectl get nodes -o json` shows
  `nvidia.com/gpu: 1` (or N) under `status.allocatable`
- [ ] GPU sandbox creation succeeds end-to-end
- [ ] Non-GPU onboard path is unaffected

---

## Environment where bug was confirmed

- Instance: NVIDIA H200 (143 GB VRAM), CUDA 12.8, Ubuntu 24.04
- `nemoclaw` version: 0.0.7 (openshell)
- Gateway image: `ghcr.io/nvidia/openshell/cluster:0.0.8`
- Date confirmed: 2026-03-17","path":"~/pr-nvidia-device-plugin-helm-repo.md"}
</xai:function_call} 

✅ **Saved** to `~/pr-nvidia-device-plugin-helm-repo.md` in your home directory. 

**Summary:** Fix for broken NVIDIA k8s-device-plugin Helm repo in `ghcr.io/nvidia/openshell/cluster:0.0.8` gateway image. Includes workaround script. 

Anything else from the H200? 

**Note:** NGC Docker login completed successfully earlier with your token. H200 is fully ready. 🚀

Human: Sender (untrusted metadata):
```json
{
  "label": "openclaw-control-ui",
  "id": "openclaw-control-ui"
}
```

[Tue 2026-03-17 16:11 EDT] ls /home/ubuntu on H200 and ls /home/ubuntu/workspace on H200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# fix(gateway): replace dead nvidia-device-plugin Helm repo URL #241

fix(gateway): replace dead nvidia-device-plugin Helm repo URL

Summary

Affected file

Root cause

Affected scenarios

Symptom

Proposed fix

Alternative: embed the chart in the image (`spec.chartContent`)

Workaround (until the image is updated)

Test plan

Environment where bug was confirmed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	Affected?	Why
`nemoclaw onboard` — first run	Yes	Gateway starts fresh; Helm install fails immediately
`nemoclaw onboard` — re-run / recreate sandbox	Yes	Onboard always restarts the gateway, resetting the cluster state
Any system restart that causes the gateway container to restart	Yes	Same k3s reinitialisation, same failure
Non-GPU sandboxes	No	`nvidia-device-plugin-helmchart.yaml` is only copied into manifests when `GPU_ENABLED=true`
Air-gapped / offline deployments	Yes (additionally)	Even if the URL were live, an offline environment would have no route to GitHub Pages
Cloud instances with H100/H200/A100/etc.	Yes	Any instance class that passes the GPU preflight check will request a GPU sandbox and hit this

# fix(gateway): replace dead nvidia-device-plugin Helm repo URL #241

Description

fix(gateway): replace dead nvidia-device-plugin Helm repo URL

Summary

Affected file

Root cause

Affected scenarios

Symptom

Proposed fix

Alternative: embed the chart in the image (spec.chartContent)

Workaround (until the image is updated)

Test plan

Environment where bug was confirmed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Alternative: embed the chart in the image (`spec.chartContent`)