Skip to content

setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8] #2418

@yanyunl1991

Description

@yanyunl1991

Summary

On JetPack R39 (L4T 39.1) hosts, nemoclaw onboard completes the sandbox image build and gateway upload but fails at [7/8] Setting up OpenClaw inside sandbox with:

× status: FailedPrecondition, message: "sandbox is not ready"
Command failed (exit 1): openshell sandbox connect <sandbox-name>                                                                                                                                                                                                                                                              

Root cause: scripts/setup-jetson.sh early-returns on any L4T release ≥ 39 with the message "this version does not require any host setup", skipping the modprobe br_netfilter + sysctl net.bridge.bridge-nf-call-iptables=1 lines that live in configure_jetson_host(). Without br_netfilter loaded, Linux iptables
does not process traffic traversing a bridge interface, so the kube-proxy NAT rules written by k3s (running inside the OpenShell gateway container) never match pod→ClusterIP packets. Sandbox agent pods then fail to resolve openshell.openshell.svc.cluster.local via the cluster DNS service, crash on startup
(Temporary failure in name resolution), and nemoclaw deletes the never-Ready sandbox.

Environment

  • Jetson Orin AGX
  • JetPack R39 (L4T 39.1.0), kernel 6.8.12-1018-tegra (and later kernels; both tested)
  • Ubuntu 24.04.4 LTS
  • Docker 29.4.0
  • OpenShell CLI 0.0.32, gateway image ghcr.io/nvidia/openshell/cluster:0.0.32
  • NemoClaw from main (tested on several recent tags up through the unreleased post-0.0.23 commits)

Reproduction

  1. Fresh JetPack R39 Jetson (no prior NemoClaw install).
  2. curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash
  3. nemoclaw onboard with any inference provider and model (repro'd with "Other OpenAI-compatible endpoint").

Expected: onboard runs to completion, sandbox reaches Ready, dashboard comes up.

Actual: builds + uploads image fine. At [7/8] fails with "sandbox is not ready". kubectl into the gateway's k3s (below) shows the sandbox pod's main container in CrashLoopBackOff with DNS errors.

Diagnostic Chain

Sandbox pod agent container crash-loops on cluster DNS

$ sudo docker exec openshell-cluster-nemoclaw kubectl describe pod my-assistant -n openshell | tail -30
...                                                                                                                                                                                                                                                                                                                            
  agent:                                               
    State:          Waiting                                                                                                                                                                                                                                                                                                    
      Reason:       CrashLoopBackOff                   
    Last State:     Terminated                                                                                                                                                                                                                                                                                                 
      Reason:       Error                                                                                                                                                                                                                                                                                                      
      Exit Code:    1
    Restart Count:  156   # over 13 hours                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                               
$ sudo docker exec openshell-cluster-nemoclaw kubectl logs my-assistant -n openshell -c agent --previous --tail=20
openshell: log push connect failed: failed to connect to OpenShell server                                                                                                                                                                                                                                                      
Error:   × failed to connect to OpenShell server                                                                                                                                                                                                                                                                               
  ├─▶ transport error                                                                                                                                                                                                                                                                                                          
  ├─▶ dns error                                                                                                                                                                                                                                                                                                                
  ├─▶ dns error                                        
  ╰─▶ failed to lookup address information: Temporary failure in name resolution                                                                                                                                                                                                                                               

The agent's OPENSHELL_ENDPOINT is https://openshell.openshell.svc.cluster.local:8080 — resolves this via the cluster DNS service.

Service IP unreachable from new pods, pod IP works

Running nslookup from a fresh test pod in the same namespace, targeting CoreDNS explicitly:

# via CoreDNS POD IP (bypasses kube-proxy / service NAT):
$ kubectl run dns-pod --image=busybox -n openshell --rm -it --command -- \
    nslookup openshell.openshell.svc.cluster.local 10.42.0.4                                                                                                                                                                                                                                                                   
Server:     10.42.0.4                                                                                                                                                                                                                                                                                                          
Name:       openshell.openshell.svc.cluster.local                                                                                                                                                                                                                                                                              
Address:    10.43.225.49                # ← works                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                               
# via CoreDNS SERVICE IP (normal pod DNS path, via kube-proxy NAT):                                                                                                                                                                                                                                                            
$ kubectl run dns-svc --image=busybox -n openshell --rm -it --command -- \                                                                                                                                                                                                                                                     
    nslookup openshell.openshell.svc.cluster.local 10.43.0.10
;; connection timed out; no servers could be reached                                                                                                                                                                                                                                                                           

kube-proxy's iptables rules are present:

$ iptables -t nat -L KUBE-SERVICES | grep kube-dns     
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere  10.43.0.10  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain                                                                                                                                                                                                             
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere  10.43.0.10  /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain                                                                                                                                                                                                         

…but packets aren't hitting them for pod→service traffic. Classic br_netfilter-not-loaded signature.

br_netfilter not loaded on host

$ lsmod | grep br_netfilter                                                                                                                                                                                                                                                                                                    
(empty)
                                                                                                                                                                                                                                                                                                                               
$ cat /proc/sys/net/bridge/bridge-nf-call-iptables                                                                                                                                                                                                                                                                             
cat: /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory

Loading the module immediately fixes it

$ sudo modprobe br_netfilter                                                                                                                                                                                                                                                                                                   
$ sudo sysctl -w net.bridge.bridge-nf-call-iptables=1  
$ sudo sysctl -w net.bridge.bridge-nf-call-ip6tables=1                                                                                                                                                                                                                                                                         
 
# retry the previously-failing query:                                                                                                                                                                                                                                                                                          
$ kubectl run dns-svc-retry --image=busybox -n openshell --rm -it --command -- \                                                                                                                                                                                                                                               
    nslookup openshell.openshell.svc.cluster.local 10.43.0.10                                                                                                                                                                                                                                                                  
Server:     10.43.0.10                                                                                                                                                                                                                                                                                                         
Name:       openshell.openshell.svc.cluster.local                                                                                                                                                                                                                                                                              
Address:    10.43.225.49                # ← now works                                                                                                                                                                                                                                                                          

After openshell sandbox delete + nemoclaw onboard the full flow completes through [8/8].

Full nemoclaw onboard log

Click to expand — complete onboard output (clean-slate reinstall) showing [7/8] failure after [1/8]–[6/8] succeed
$ node bin/nemoclaw.js onboard                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                               
  Third-Party Software Notice - NemoClaw Installer
  ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
  [notice accepted]                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                               
  NemoClaw Onboarding                                                                                                                                                                                                                                                                                                          
  ===================                                                                                                                                                                                                                                                                                                          
 
  [1/8] Preflight checks                                                                                                                                                                                                                                                                                                       
  ──────────────────────────────────────────────────   
  ✓ Docker is running                                                                                                                                                                                                                                                                                                          
  ✓ Container DNS resolution works
  ✓ Container runtime: docker                                                                                                                                                                                                                                                                                                  
  openshell CLI not found. Installing...                                                                                                                                                                                                                                                                                       
  ✓ openshell CLI: openshell 0.0.32
  ✓ Port 8080 available (OpenShell gateway)                                                                                                                                                                                                                                                                                    
  ✓ Port 18789 available (NemoClaw dashboard)                                                                                                                                                                                                                                                                                  
  ✓ NVIDIA GPU detected: 1 GPU(s), 62878 MB VRAM                                                                                                                                                                                                                                                                               
  ✓ Memory OK: 62878 MB RAM + 2047 MB swap                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                               
  [2/8] Starting OpenShell gateway                                                                                                                                                                                                                                                                                             
  ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
  Using pinned OpenShell gateway image: ghcr.io/nvidia/openshell/cluster:0.0.32                                                                                                                                                                                                                                                
  Starting gateway cluster...
  [… ~80s startup …]                                                                                                                                                                                                                                                                                                           
  Installing OpenShell components...                   
  Starting OpenShell gateway pod...                                                                                                                                                                                                                                                                                            
  ✓ Gateway is healthy                                 
  ✓ Active gateway set to 'nemoclaw'                                                                                                                                                                                                                                                                                           
 
  [3/8] Configuring inference (NIM)                                                                                                                                                                                                                                                                                            
  ──────────────────────────────────────────────────   
  [selected "Other OpenAI-compatible endpoint", model aws/anthropic/bedrock-claude-sonnet-4-6]
                                                                                                                                                                                                                                                                                                                               
  [4/8] Setting up inference provider
  ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
  ✓ Inference route set: compatible-endpoint / aws/anthropic/bedrock-claude-sonnet-4-6                                                                                                                                                                                                                                         
 
  [5/8] Messaging channels                                                                                                                                                                                                                                                                                                     
  ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
  Skipping messaging channels.
                                                                                                                                                                                                                                                                                                                               
  [6/8] Creating sandbox                               
  ──────────────────────────────────────────────────
  Sandbox name [my-assistant]:
  Creating sandbox 'my-assistant'...                                                                                                                                                                                                                                                                                           
  Pinning base image to sha256:edb8718c15e9...
  Building sandbox image...                                                                                                                                                                                                                                                                                                    
  Building image openshell/sandbox-from:1777005133 from /tmp/nemoclaw-build-ih6Mbu/Dockerfile                                                                                                                                                                                                                                  
  Step 1/52 : ARG BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base@sha256:edb8718c15e9…
  Step 2/52 : FROM node:22-slim@sha256:4f77a690…                                                                                                                                                                                                                                                                               
  [… all 52 Dockerfile steps complete successfully …]                                                                                                                                                                                                                                                                          
  Step 52/52 : CMD ["/bin/bash"]                                                                                                                                                                                                                                                                                               
  Built image openshell/sandbox-from:1777005133                                                                                                                                                                                                                                                                                
  Uploading image into OpenShell gateway...            
  Pushing image openshell/sandbox-from:1777005133 into gateway "nemoclaw"                                                                                                                                                                                                                                                      
  [progress] Exported 100 MiB                                                                                                                                                                                                                                                                                                  
  [progress] Exported 200 MiB
  [progress] Exported 300 MiB                                                                                                                                                                                                                                                                                                  
  [progress] Exported 400 MiB                          
  [progress] Exported 500 MiB                                                                                                                                                                                                                                                                                                  
  [progress] Exported 600 MiB
  [progress] Exported 700 MiB                                                                                                                                                                                                                                                                                                  
  [progress] Exported 724 MiB                          
  [progress] Uploaded to gateway                                                                                                                                                                                                                                                                                               
  Still uploading image into OpenShell gateway... (730s elapsed)
  [… ~50s more …]                                                                                                                                                                                                                                                                                                              
  Image openshell/sandbox-from:1777005133 is available in the gateway.
  Waiting for sandbox to become ready...                                                                                                                                                                                                                                                                                       
  Sandbox reported Ready before create stream exited; continuing.
  Waiting for sandbox to become ready...                                                                                                                                                                                                                                                                                       
  Waiting for NemoClaw dashboard to become ready...    
  Dashboard taking longer than expected to start. Continuing...                                                                                                                                                                                                                                                                
! No active forward found for port 18789               
! Port 18789 forward did not start — port may be in use by another process.                                                                                                                                                                                                                                                    
  Check: docker ps --format 'table {{.Names}}\t{{.Ports}}' | grep 18789
  Free the port, then reconnect: nemoclaw my-assistant connect                                                                                                                                                                                                                                                                 
  Setting up sandbox DNS proxy...                      
Setting up DNS proxy in pod 'my-assistant' (10.200.0.1:53 -> 10.42.0.4)...                                                                                                                                                                                                                                                     
Defaulted container "agent" out of: agent, workspace-init (init)                                                                                                                                                                                                                                                               
error: Internal error occurred: unable to upgrade connection: container not found ("agent")                                                                                                                                                                                                                                    
  ✓ Sandbox 'my-assistant' created                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                               
  [7/8] Setting up OpenClaw inside sandbox                                                                                                                                                                                                                                                                                     
  ──────────────────────────────────────────────────
Error:   × status: FailedPrecondition, message: "sandbox is not ready", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 24 Apr 2026 04:48:09 GMT"} }                                                                                                                          
                                                                                                                                                                                                                                                                                                                               
  Command failed (exit 1): /home/ubuntu/.local/bin/openshell sandbox connect my-assistant
  This error originated from the OpenShell runtime layer.                                                                                                                                                                                                                                                                      
  Docs: https://github.com/NVIDIA/OpenShell                                                                                                                                                                                                                                                                                    

Key observations from this log:

  • [1/8] passes (including the new container-DNS preflight from [Jetson][Orin]Sandbox image build fails on RUN “npm ci && npm run build: npm” error Exit handler never called #2101 which reports ✓ Container DNS resolution works at the host-bridge layer).
  • [2/8][6/8] succeed. The sandbox image builds and uploads (724 MiB).
  • The first symptom of the underlying bug appears during the Setting up sandbox DNS proxy step inside [6/8]:
    container not found ("agent") — kubectl can't exec into the agent container because it's been CrashLoopBackOff'd since the pod's init container finished (the workspace-init initContainer succeeds; the main agent container crashes on startup because it can't resolve openshell.openshell.svc.cluster.local).
  • [7/8] then fails with the external "sandbox is not ready" from openshell sandbox connect.

Root Cause (in code)

scripts/setup-jetson.sh:34-37:

if ((release >= 39)); then                                                                                                                                                                                                                                                                                                     
  info "Jetson detected (L4T $l4t_version) — this version does not require any host setup" >&2
  return 0                                                                                                                                                                                                                                                                                                                     
fi

Because get_jetpack_version() returns empty for R39, main() exits before calling configure_jetson_host(), where the common modprobe br_netfilter / sysctl / /etc/modules-load.d/nemoclaw.conf lines live (lines 127–132 of the same file). The JP6 and JP7-R38 branches reach those lines via
configure_jetson_host(); R39 doesn't.

The assumption that R39 requires no host setup appears to be incorrect once k3s inside the OpenShell gateway container is involved — the module + sysctl are needed regardless of the JetPack version for in-gateway ClusterIP service routing to work.

Proposed Fix

Two small edits to scripts/setup-jetson.sh:

  1. In get_jetpack_version(): drop the >= 39 early-return, add a 39.*) printf "%s" "jp7-r39" ;; case.
  2. In configure_jetson_host(): add a jp7-r39) ;; no-op branch (no iptables-legacy / daemon.json changes needed, same as R38) so the common modprobe + sysctl + persistence lines at the bottom of the function run.

Happy to send a PR with this.

Workaround (for users hit by this today)

Run once on the host, then rerun nemoclaw onboard:

sudo modprobe br_netfilter                             
echo "br_netfilter" | sudo tee /etc/modules-load.d/nemoclaw.conf
sudo tee /etc/sysctl.d/99-nemoclaw.conf <<EOF
net.bridge.bridge-nf-call-iptables=1                                                                                                                                                                                                                                                                                           
net.bridge.bridge-nf-call-ip6tables=1
EOF                                                                                                                                                                                                                                                                                                                            
sudo sysctl --system                                   

Then if a previous partial sandbox exists: openshell sandbox delete <name> and rerun onboard.

Related

Metadata

Metadata

Assignees

Labels

area: cliCommand line interface, flags, terminal UX, or outputarea: installInstall, setup, prerequisites, or uninstall flowarea: onboardingOnboarding FSM, provider setup, sandbox launch, or first-run flowneeds: triageAwaiting maintainer classificationplatform: jetsonAffects Jetson AGX Thor or Orin

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions