You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On JetPack R39 (L4T 39.1) hosts, nemoclaw onboard completes the sandbox image build and gateway upload but fails at [7/8] Setting up OpenClaw inside sandbox with:
× status: FailedPrecondition, message: "sandbox is not ready"
Command failed (exit 1): openshell sandbox connect <sandbox-name>
Root cause: scripts/setup-jetson.sh early-returns on any L4T release ≥ 39 with the message "this version does not require any host setup", skipping the modprobe br_netfilter + sysctl net.bridge.bridge-nf-call-iptables=1 lines that live in configure_jetson_host(). Without br_netfilter loaded, Linux iptables
does not process traffic traversing a bridge interface, so the kube-proxy NAT rules written by k3s (running inside the OpenShell gateway container) never match pod→ClusterIP packets. Sandbox agent pods then fail to resolve openshell.openshell.svc.cluster.local via the cluster DNS service, crash on startup
(Temporary failure in name resolution), and nemoclaw deletes the never-Ready sandbox.
Environment
Jetson Orin AGX
JetPack R39 (L4T 39.1.0), kernel 6.8.12-1018-tegra (and later kernels; both tested)
nemoclaw onboard with any inference provider and model (repro'd with "Other OpenAI-compatible endpoint").
Expected: onboard runs to completion, sandbox reaches Ready, dashboard comes up.
Actual: builds + uploads image fine. At [7/8] fails with "sandbox is not ready". kubectl into the gateway's k3s (below) shows the sandbox pod's main container in CrashLoopBackOff with DNS errors.
Diagnostic Chain
Sandbox pod agent container crash-loops on cluster DNS
$ sudo docker exec openshell-cluster-nemoclaw kubectl describe pod my-assistant -n openshell | tail -30
...
agent:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Restart Count: 156 # over 13 hours
$ sudo docker exec openshell-cluster-nemoclaw kubectl logs my-assistant -n openshell -c agent --previous --tail=20
openshell: log push connect failed: failed to connect to OpenShell server
Error: × failed to connect to OpenShell server
├─▶ transport error
├─▶ dns error
├─▶ dns error
╰─▶ failed to lookup address information: Temporary failure in name resolution
The agent's OPENSHELL_ENDPOINT is https://openshell.openshell.svc.cluster.local:8080 — resolves this via the cluster DNS service.
Service IP unreachable from new pods, pod IP works
Running nslookup from a fresh test pod in the same namespace, targeting CoreDNS explicitly:
# via CoreDNS POD IP (bypasses kube-proxy / service NAT):
$ kubectl run dns-pod --image=busybox -n openshell --rm -it --command -- \
nslookup openshell.openshell.svc.cluster.local 10.42.0.4
Server: 10.42.0.4
Name: openshell.openshell.svc.cluster.local
Address: 10.43.225.49 # ← works
# via CoreDNS SERVICE IP (normal pod DNS path, via kube-proxy NAT):
$ kubectl run dns-svc --image=busybox -n openshell --rm -it --command -- \
nslookup openshell.openshell.svc.cluster.local 10.43.0.10
;; connection timed out; no servers could be reached
[2/8]–[6/8] succeed. The sandbox image builds and uploads (724 MiB).
The first symptom of the underlying bug appears during the Setting up sandbox DNS proxy step inside [6/8]: container not found ("agent") — kubectl can't exec into the agent container because it's been CrashLoopBackOff'd since the pod's init container finished (the workspace-init initContainer succeeds; the main agent container crashes on startup because it can't resolve openshell.openshell.svc.cluster.local).
[7/8] then fails with the external "sandbox is not ready" from openshell sandbox connect.
Root Cause (in code)
scripts/setup-jetson.sh:34-37:
if((release >=39));then
info "Jetson detected (L4T $l4t_version) — this version does not require any host setup">&2return 0
fi
Because get_jetpack_version() returns empty for R39, main() exits before calling configure_jetson_host(), where the common modprobe br_netfilter / sysctl / /etc/modules-load.d/nemoclaw.conf lines live (lines 127–132 of the same file). The JP6 and JP7-R38 branches reach those lines via configure_jetson_host(); R39 doesn't.
The assumption that R39 requires no host setup appears to be incorrect once k3s inside the OpenShell gateway container is involved — the module + sysctl are needed regardless of the JetPack version for in-gateway ClusterIP service routing to work.
Proposed Fix
Two small edits to scripts/setup-jetson.sh:
In get_jetpack_version(): drop the >= 39 early-return, add a 39.*) printf "%s" "jp7-r39" ;; case.
In configure_jetson_host(): add a jp7-r39) ;; no-op branch (no iptables-legacy / daemon.json changes needed, same as R38) so the common modprobe + sysctl + persistence lines at the bottom of the function run.
Happy to send a PR with this.
Workaround (for users hit by this today)
Run once on the host, then rerun nemoclaw onboard:
sudo modprobe br_netfilter
echo"br_netfilter"| sudo tee /etc/modules-load.d/nemoclaw.conf
sudo tee /etc/sysctl.d/99-nemoclaw.conf <<EOFnet.bridge.bridge-nf-call-iptables=1 net.bridge.bridge-nf-call-ip6tables=1EOF
sudo sysctl --system
Then if a previous partial sandbox exists: openshell sandbox delete <name> and rerun onboard.
Summary
On JetPack R39 (L4T 39.1) hosts,
nemoclaw onboardcompletes the sandbox image build and gateway upload but fails at [7/8] Setting up OpenClaw inside sandbox with:Root cause:
scripts/setup-jetson.shearly-returns on any L4T release ≥ 39 with the message "this version does not require any host setup", skipping themodprobe br_netfilter+sysctl net.bridge.bridge-nf-call-iptables=1lines that live inconfigure_jetson_host(). Withoutbr_netfilterloaded, Linux iptablesdoes not process traffic traversing a bridge interface, so the kube-proxy NAT rules written by k3s (running inside the OpenShell gateway container) never match pod→ClusterIP packets. Sandbox agent pods then fail to resolve
openshell.openshell.svc.cluster.localvia the cluster DNS service, crash on startup(
Temporary failure in name resolution), and nemoclaw deletes the never-Ready sandbox.Environment
6.8.12-1018-tegra(and later kernels; both tested)ghcr.io/nvidia/openshell/cluster:0.0.32main(tested on several recent tags up through the unreleased post-0.0.23 commits)Reproduction
curl -fsSL https://www.nvidia.com/nemoclaw.sh | bashnemoclaw onboardwith any inference provider and model (repro'd with "Other OpenAI-compatible endpoint").Expected: onboard runs to completion, sandbox reaches Ready, dashboard comes up.
Actual: builds + uploads image fine. At [7/8] fails with
"sandbox is not ready".kubectlinto the gateway's k3s (below) shows the sandbox pod's main container inCrashLoopBackOffwith DNS errors.Diagnostic Chain
Sandbox pod agent container crash-loops on cluster DNS
The agent's
OPENSHELL_ENDPOINTishttps://openshell.openshell.svc.cluster.local:8080— resolves this via the cluster DNS service.Service IP unreachable from new pods, pod IP works
Running
nslookupfrom a fresh test pod in the same namespace, targeting CoreDNS explicitly:kube-proxy's iptables rules are present:
…but packets aren't hitting them for pod→service traffic. Classic
br_netfilter-not-loaded signature.br_netfilternot loaded on hostLoading the module immediately fixes it
After
openshell sandbox delete+nemoclaw onboardthe full flow completes through [8/8].Full
nemoclaw onboardlogClick to expand — complete onboard output (clean-slate reinstall) showing [7/8] failure after [1/8]–[6/8] succeed
Key observations from this log:
[1/8]passes (including the new container-DNS preflight from [Jetson][Orin]Sandbox image build fails on RUN “npm ci && npm run build: npm” error Exit handler never called #2101 which reports✓ Container DNS resolution worksat the host-bridge layer).[2/8]–[6/8]succeed. The sandbox image builds and uploads (724 MiB).[6/8]:container not found ("agent")— kubectl can't exec into theagentcontainer because it's been CrashLoopBackOff'd since the pod's init container finished (theworkspace-initinitContainer succeeds; the mainagentcontainer crashes on startup because it can't resolveopenshell.openshell.svc.cluster.local).[7/8]then fails with the external"sandbox is not ready"fromopenshell sandbox connect.Root Cause (in code)
scripts/setup-jetson.sh:34-37:Because
get_jetpack_version()returns empty for R39,main()exits before callingconfigure_jetson_host(), where the commonmodprobe br_netfilter/sysctl//etc/modules-load.d/nemoclaw.conflines live (lines 127–132 of the same file). The JP6 and JP7-R38 branches reach those lines viaconfigure_jetson_host(); R39 doesn't.The assumption that R39 requires no host setup appears to be incorrect once k3s inside the OpenShell gateway container is involved — the module + sysctl are needed regardless of the JetPack version for in-gateway ClusterIP service routing to work.
Proposed Fix
Two small edits to
scripts/setup-jetson.sh:get_jetpack_version(): drop the>= 39early-return, add a39.*) printf "%s" "jp7-r39" ;;case.configure_jetson_host(): add ajp7-r39) ;;no-op branch (no iptables-legacy / daemon.json changes needed, same as R38) so the commonmodprobe+ sysctl + persistence lines at the bottom of the function run.Happy to send a PR with this.
Workaround (for users hit by this today)
Run once on the host, then rerun
nemoclaw onboard:Then if a previous partial sandbox exists:
openshell sandbox delete <name>and rerun onboard.Related