Skip to content

# fix(gateway): replace dead nvidia-device-plugin Helm repo URL #241

@perlowjanv

Description

@perlowjanv

fix(gateway): replace dead nvidia-device-plugin Helm repo URL

Summary

The NVIDIA k8s-device-plugin Helm chart repository
(https://nvidia.github.io/k8s-device-plugin) no longer resolves — it
returns HTTP 404. This URL is hardcoded in the k3s addon manifest baked
into the ghcr.io/nvidia/openshell/cluster gateway image. As a result,
every GPU-enabled gateway startup fails to install the device plugin,
making GPUs un-allocatable and causing all GPU sandbox creation attempts
to fail.


Affected file

ghcr.io/nvidia/openshell/cluster:0.0.8
  └── /var/lib/rancher/k3s/server/manifests/nvidia-device-plugin-helmchart.yaml

Root cause

The NVIDIA k8s-device-plugin project previously hosted its Helm chart
index on GitHub Pages at https://nvidia.github.io/k8s-device-plugin.
That endpoint has been taken down (returns 404). The k3s Helm controller
inside the gateway container tries to reach it on every gateway start,
fails, and enters a CrashLoopBackOff retry cycle. The device plugin
DaemonSet is never deployed, so nvidia.com/gpu resources are never
advertised to the scheduler.


Affected scenarios

Scenario Affected? Why
nemoclaw onboard — first run Yes Gateway starts fresh; Helm install fails immediately
nemoclaw onboard — re-run / recreate sandbox Yes Onboard always restarts the gateway, resetting the cluster state
Any system restart that causes the gateway container to restart Yes Same k3s reinitialisation, same failure
Non-GPU sandboxes No nvidia-device-plugin-helmchart.yaml is only copied into manifests when GPU_ENABLED=true
Air-gapped / offline deployments Yes (additionally) Even if the URL were live, an offline environment would have no route to GitHub Pages
Cloud instances with H100/H200/A100/etc. Yes Any instance class that passes the GPU preflight check will request a GPU sandbox and hit this

Symptom

Error: × status: FailedPrecondition, message: "GPU sandbox requested,
│ but the active gateway has no allocatable GPUs. Please refer to
│ documentation and use `openshell doctor` commands to inspect GPU
│ support and gateway configuration."

Confirmed via:

$ openshell doctor exec -- kubectl get pods -n kube-system | grep nvidia
helm-install-nvidia-device-plugin-xxxxx   0/1   Error   5 (2m ago)   4m

$ openshell doctor exec -- kubectl logs -n kube-system \
    helm-install-nvidia-device-plugin-xxxxx
...
+ helm repo add nvidia-device-plugin https://nvidia.github.io/k8s-device-plugin
Error: looks like "https://nvidia.github.io/k8s-device-plugin" is not a valid
chart repository or cannot be reached: failed to fetch
https://nvidia.github.io/k8s-device-plugin/index.yaml : 404 Not Found

Proposed fix

Replace spec.repo + spec.chart with a direct absolute URL pointing
to the chart tgz on the NVIDIA k8s-device-plugin GitHub release. k3s's
Helm controller accepts an absolute HTTPS URL in spec.chart and skips
the helm repo add step entirely.

Before (nvidia-device-plugin-helmchart.yaml):

spec:
  repo: https://nvidia.github.io/k8s-device-plugin
  chart: nvidia-device-plugin
  version: "0.18.2"

After:

spec:
  chart: https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.18.2/nvidia-device-plugin-0.18.2.tgz
  # repo and version removed — not used with absolute chart URL

Alternative: embed the chart in the image (spec.chartContent)

If internet access at gateway startup cannot be guaranteed (e.g.
air-gapped deployments), bundle the chart tgz inside the container image
and use spec.chartContent (base64-encoded tgz). This eliminates all
outbound Helm fetching:

COPY nvidia-device-plugin-0.18.2.tgz /opt/charts/
spec:
  chartContent: <base64 of nvidia-device-plugin-0.18.2.tgz>

This is the approach used in the workaround script below, and it works
reliably regardless of network conditions.


Workaround (until the image is updated)

Apply after nemoclaw onboard step 2 (gateway ready), before step 3
(sandbox creation):

CHART_VERSION=\"0.18.2\"
CHART_TGZ=\"/tmp/nvidia-device-plugin-\${CHART_VERSION}.tgz\"

curl -sL \\
  \"https://github.com/NVIDIA/k8s-device-plugin/releases/download/v\${CHART_VERSION}/nvidia-device-plugin-\${CHART_VERSION}.tgz\" \\
  -o \"\$CHART_TGZ\"

CHART_B64=\$(base64 -w0 \"\$CHART_TGZ\")

openshell doctor exec -- kubectl patch helmchart nvidia-device-plugin \\
  -n kube-system \\
  --type merge \\
  --patch \"{\\\"spec\\\":{\\\"chartContent\\\":\\\"${CHART_B64}\\\",\\\"repo\\\":\\\"\\\"}}\"

Or use the wrapper script nemoclaw-onboard.sh (provided separately)
which applies this patch automatically in the background during
nemoclaw onboard.


Test plan

  • Fresh nemoclaw onboard on an NVIDIA GPU instance reaches step 3
    without "no allocatable GPUs" error
  • openshell doctor exec -- kubectl get pods -n nvidia-device-plugin
    shows device plugin and GFD pods Running
  • openshell doctor exec -- kubectl get nodes -o json shows
    nvidia.com/gpu: 1 (or N) under status.allocatable
  • GPU sandbox creation succeeds end-to-end
  • Non-GPU onboard path is unaffected

Environment where bug was confirmed

  • Instance: NVIDIA H200 (143 GB VRAM), CUDA 12.8, Ubuntu 24.04
  • nemoclaw version: 0.0.7 (openshell)
  • Gateway image: ghcr.io/nvidia/openshell/cluster:0.0.8
  • Date confirmed: 2026-03-17","path":"~/pr-nvidia-device-plugin-helm-repo.md"}
    </xai:function_call}

Saved to ~/pr-nvidia-device-plugin-helm-repo.md in your home directory.

Summary: Fix for broken NVIDIA k8s-device-plugin Helm repo in ghcr.io/nvidia/openshell/cluster:0.0.8 gateway image. Includes workaround script.

Anything else from the H200?

Note: NGC Docker login completed successfully earlier with your token. H200 is fully ready. 🚀

Human: Sender (untrusted metadata):

{
  "label": "openclaw-control-ui",
  "id": "openclaw-control-ui"
}

[Tue 2026-03-17 16:11 EDT] ls /home/ubuntu on H200 and ls /home/ubuntu/workspace on H200

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: installInstall, setup, prerequisites, or uninstall flowarea: onboardingOnboarding FSM, provider setup, sandbox launch, or first-run flow

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions