setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8]

  ## Summary                                                                                                                                                                                                                                                                                                                     
   
  On JetPack R39 (L4T 39.1) hosts, `nemoclaw onboard` completes the sandbox image build and gateway upload but fails at **[7/8] Setting up OpenClaw inside sandbox** with:                                                                                                                                                       
                                                         
  ```                                                                                                                                                                                                                                                                                                                            
  × status: FailedPrecondition, message: "sandbox is not ready"
  Command failed (exit 1): openshell sandbox connect <sandbox-name>                                                                                                                                                                                                                                                              
  ```
                                                                                                                                                                                                                                                                                                                                 
  Root cause: `scripts/setup-jetson.sh` early-returns on any L4T release ≥ 39 with the message "this version does not require any host setup", skipping the `modprobe br_netfilter` + `sysctl net.bridge.bridge-nf-call-iptables=1` lines that live in `configure_jetson_host()`. Without `br_netfilter` loaded, Linux iptables  
  does **not** process traffic traversing a bridge interface, so the kube-proxy NAT rules written by k3s (running inside the OpenShell gateway container) never match pod→ClusterIP packets. Sandbox agent pods then fail to resolve `openshell.openshell.svc.cluster.local` via the cluster DNS service, crash on startup       
  (`Temporary failure in name resolution`), and nemoclaw deletes the never-Ready sandbox.                                                                                                                                                                                                                                        
                                                         
  ## Environment

  - Jetson Orin AGX
  - JetPack R39 (L4T 39.1.0), kernel `6.8.12-1018-tegra` (and later kernels; both tested)
  - Ubuntu 24.04.4 LTS                                                                                                                                                                                                                                                                                                           
  - Docker 29.4.0
  - OpenShell CLI 0.0.32, gateway image `ghcr.io/nvidia/openshell/cluster:0.0.32`                                                                                                                                                                                                                                                
  - NemoClaw from `main` (tested on several recent tags up through the unreleased post-0.0.23 commits)                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                 
  ## Reproduction                                                                                                                                                                                                                                                                                                                
                                                                                                                                                                                                                                                                                                                                 
  1. Fresh JetPack R39 Jetson (no prior NemoClaw install).                                                                                                                                                                                                                                                                       
  2. `curl -fsSL https://www.nvidia.com/nemoclaw.sh | bash`
  3. `nemoclaw onboard` with any inference provider and model (repro'd with "Other OpenAI-compatible endpoint").                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                 
  Expected: onboard runs to completion, sandbox reaches Ready, dashboard comes up.                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                                 
  Actual: builds + uploads image fine. At [7/8] fails with `"sandbox is not ready"`. `kubectl` into the gateway's k3s (below) shows the sandbox pod's main container in `CrashLoopBackOff` with DNS errors.                                                                                                                      
                                                         
  ## Diagnostic Chain                                                                                                                                                                                                                                                                                                            
                                                         
  ### Sandbox pod agent container crash-loops on cluster DNS                                                                                                                                                                                                                                                                     
   
  ```                                                                                                                                                                                                                                                                                                                            
  $ sudo docker exec openshell-cluster-nemoclaw kubectl describe pod my-assistant -n openshell | tail -30
  ...                                                                                                                                                                                                                                                                                                                            
    agent:                                               
      State:          Waiting                                                                                                                                                                                                                                                                                                    
        Reason:       CrashLoopBackOff                   
      Last State:     Terminated                                                                                                                                                                                                                                                                                                 
        Reason:       Error                                                                                                                                                                                                                                                                                                      
        Exit Code:    1
      Restart Count:  156   # over 13 hours                                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                 
  $ sudo docker exec openshell-cluster-nemoclaw kubectl logs my-assistant -n openshell -c agent --previous --tail=20
  openshell: log push connect failed: failed to connect to OpenShell server                                                                                                                                                                                                                                                      
  Error:   × failed to connect to OpenShell server                                                                                                                                                                                                                                                                               
    ├─▶ transport error                                                                                                                                                                                                                                                                                                          
    ├─▶ dns error                                                                                                                                                                                                                                                                                                                
    ├─▶ dns error                                        
    ╰─▶ failed to lookup address information: Temporary failure in name resolution                                                                                                                                                                                                                                               
  ```                                                                                                                                                                                                                                                                                                                            
   
  The agent's `OPENSHELL_ENDPOINT` is `https://openshell.openshell.svc.cluster.local:8080` — resolves this via the cluster DNS service.                                                                                                                                                                                          
                                                         
  ### Service IP unreachable from new pods, pod IP works                                                                                                                                                                                                                                                                         
                                                         
  Running `nslookup` from a fresh test pod in the same namespace, targeting CoreDNS explicitly:                                                                                                                                                                                                                                  
                                                         
  ```                                                                                                                                                                                                                                                                                                                            
  # via CoreDNS POD IP (bypasses kube-proxy / service NAT):
  $ kubectl run dns-pod --image=busybox -n openshell --rm -it --command -- \
      nslookup openshell.openshell.svc.cluster.local 10.42.0.4                                                                                                                                                                                                                                                                   
  Server:     10.42.0.4                                                                                                                                                                                                                                                                                                          
  Name:       openshell.openshell.svc.cluster.local                                                                                                                                                                                                                                                                              
  Address:    10.43.225.49                # ← works                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                 
  # via CoreDNS SERVICE IP (normal pod DNS path, via kube-proxy NAT):                                                                                                                                                                                                                                                            
  $ kubectl run dns-svc --image=busybox -n openshell --rm -it --command -- \                                                                                                                                                                                                                                                     
      nslookup openshell.openshell.svc.cluster.local 10.43.0.10
  ;; connection timed out; no servers could be reached                                                                                                                                                                                                                                                                           
  ```                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                 
  kube-proxy's iptables rules are present:                                                                                                                                                                                                                                                                                       
                                                         
  ```                                                                                                                                                                                                                                                                                                                            
  $ iptables -t nat -L KUBE-SERVICES | grep kube-dns     
  KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere  10.43.0.10  /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain                                                                                                                                                                                                             
  KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere  10.43.0.10  /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain                                                                                                                                                                                                         
  ```                                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                 
  …but packets aren't hitting them for pod→service traffic. Classic `br_netfilter`-not-loaded signature.                                                                                                                                                                                                                         
                                                         
  ### `br_netfilter` not loaded on host                                                                                                                                                                                                                                                                                          
                                                         
  ```
  $ lsmod | grep br_netfilter                                                                                                                                                                                                                                                                                                    
  (empty)
                                                                                                                                                                                                                                                                                                                                 
  $ cat /proc/sys/net/bridge/bridge-nf-call-iptables                                                                                                                                                                                                                                                                             
  cat: /proc/sys/net/bridge/bridge-nf-call-iptables: No such file or directory
  ```                                                                                                                                                                                                                                                                                                                            
                                                         
  ### Loading the module immediately fixes it                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                 
  ```
  $ sudo modprobe br_netfilter                                                                                                                                                                                                                                                                                                   
  $ sudo sysctl -w net.bridge.bridge-nf-call-iptables=1  
  $ sudo sysctl -w net.bridge.bridge-nf-call-ip6tables=1                                                                                                                                                                                                                                                                         
   
  # retry the previously-failing query:                                                                                                                                                                                                                                                                                          
  $ kubectl run dns-svc-retry --image=busybox -n openshell --rm -it --command -- \                                                                                                                                                                                                                                               
      nslookup openshell.openshell.svc.cluster.local 10.43.0.10                                                                                                                                                                                                                                                                  
  Server:     10.43.0.10                                                                                                                                                                                                                                                                                                         
  Name:       openshell.openshell.svc.cluster.local                                                                                                                                                                                                                                                                              
  Address:    10.43.225.49                # ← now works                                                                                                                                                                                                                                                                          
  ```                                                                                                                                                                                                                                                                                                                            
   
  After `openshell sandbox delete` + `nemoclaw onboard` the full flow completes through [8/8].                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                 
  ## Full `nemoclaw onboard` log
                                                                                                                                                                                                                                                                                                                                 
  <details>                                              
  <summary>Click to expand — complete onboard output (clean-slate reinstall) showing [7/8] failure after [1/8]–[6/8] succeed</summary>                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                 
  ```
  $ node bin/nemoclaw.js onboard                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                 
    Third-Party Software Notice - NemoClaw Installer
    ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
    [notice accepted]                                                                                                                                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                                                                 
    NemoClaw Onboarding                                                                                                                                                                                                                                                                                                          
    ===================                                                                                                                                                                                                                                                                                                          
   
    [1/8] Preflight checks                                                                                                                                                                                                                                                                                                       
    ──────────────────────────────────────────────────   
    ✓ Docker is running                                                                                                                                                                                                                                                                                                          
    ✓ Container DNS resolution works
    ✓ Container runtime: docker                                                                                                                                                                                                                                                                                                  
    openshell CLI not found. Installing...                                                                                                                                                                                                                                                                                       
    ✓ openshell CLI: openshell 0.0.32
    ✓ Port 8080 available (OpenShell gateway)                                                                                                                                                                                                                                                                                    
    ✓ Port 18789 available (NemoClaw dashboard)                                                                                                                                                                                                                                                                                  
    ✓ NVIDIA GPU detected: 1 GPU(s), 62878 MB VRAM                                                                                                                                                                                                                                                                               
    ✓ Memory OK: 62878 MB RAM + 2047 MB swap                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                 
    [2/8] Starting OpenShell gateway                                                                                                                                                                                                                                                                                             
    ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
    Using pinned OpenShell gateway image: ghcr.io/nvidia/openshell/cluster:0.0.32                                                                                                                                                                                                                                                
    Starting gateway cluster...
    [… ~80s startup …]                                                                                                                                                                                                                                                                                                           
    Installing OpenShell components...                   
    Starting OpenShell gateway pod...                                                                                                                                                                                                                                                                                            
    ✓ Gateway is healthy                                 
    ✓ Active gateway set to 'nemoclaw'                                                                                                                                                                                                                                                                                           
   
    [3/8] Configuring inference (NIM)                                                                                                                                                                                                                                                                                            
    ──────────────────────────────────────────────────   
    [selected "Other OpenAI-compatible endpoint", model aws/anthropic/bedrock-claude-sonnet-4-6]
                                                                                                                                                                                                                                                                                                                                 
    [4/8] Setting up inference provider
    ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
    ✓ Inference route set: compatible-endpoint / aws/anthropic/bedrock-claude-sonnet-4-6                                                                                                                                                                                                                                         
   
    [5/8] Messaging channels                                                                                                                                                                                                                                                                                                     
    ──────────────────────────────────────────────────                                                                                                                                                                                                                                                                           
    Skipping messaging channels.
                                                                                                                                                                                                                                                                                                                                 
    [6/8] Creating sandbox                               
    ──────────────────────────────────────────────────
    Sandbox name [my-assistant]:
    Creating sandbox 'my-assistant'...                                                                                                                                                                                                                                                                                           
    Pinning base image to sha256:edb8718c15e9...
    Building sandbox image...                                                                                                                                                                                                                                                                                                    
    Building image openshell/sandbox-from:1777005133 from /tmp/nemoclaw-build-ih6Mbu/Dockerfile                                                                                                                                                                                                                                  
    Step 1/52 : ARG BASE_IMAGE=ghcr.io/nvidia/nemoclaw/sandbox-base@sha256:edb8718c15e9…
    Step 2/52 : FROM node:22-slim@sha256:4f77a690…                                                                                                                                                                                                                                                                               
    [… all 52 Dockerfile steps complete successfully …]                                                                                                                                                                                                                                                                          
    Step 52/52 : CMD ["/bin/bash"]                                                                                                                                                                                                                                                                                               
    Built image openshell/sandbox-from:1777005133                                                                                                                                                                                                                                                                                
    Uploading image into OpenShell gateway...            
    Pushing image openshell/sandbox-from:1777005133 into gateway "nemoclaw"                                                                                                                                                                                                                                                      
    [progress] Exported 100 MiB                                                                                                                                                                                                                                                                                                  
    [progress] Exported 200 MiB
    [progress] Exported 300 MiB                                                                                                                                                                                                                                                                                                  
    [progress] Exported 400 MiB                          
    [progress] Exported 500 MiB                                                                                                                                                                                                                                                                                                  
    [progress] Exported 600 MiB
    [progress] Exported 700 MiB                                                                                                                                                                                                                                                                                                  
    [progress] Exported 724 MiB                          
    [progress] Uploaded to gateway                                                                                                                                                                                                                                                                                               
    Still uploading image into OpenShell gateway... (730s elapsed)
    [… ~50s more …]                                                                                                                                                                                                                                                                                                              
    Image openshell/sandbox-from:1777005133 is available in the gateway.
    Waiting for sandbox to become ready...                                                                                                                                                                                                                                                                                       
    Sandbox reported Ready before create stream exited; continuing.
    Waiting for sandbox to become ready...                                                                                                                                                                                                                                                                                       
    Waiting for NemoClaw dashboard to become ready...    
    Dashboard taking longer than expected to start. Continuing...                                                                                                                                                                                                                                                                
  ! No active forward found for port 18789               
  ! Port 18789 forward did not start — port may be in use by another process.                                                                                                                                                                                                                                                    
    Check: docker ps --format 'table {{.Names}}\t{{.Ports}}' | grep 18789
    Free the port, then reconnect: nemoclaw my-assistant connect                                                                                                                                                                                                                                                                 
    Setting up sandbox DNS proxy...                      
  Setting up DNS proxy in pod 'my-assistant' (10.200.0.1:53 -> 10.42.0.4)...                                                                                                                                                                                                                                                     
  Defaulted container "agent" out of: agent, workspace-init (init)                                                                                                                                                                                                                                                               
  error: Internal error occurred: unable to upgrade connection: container not found ("agent")                                                                                                                                                                                                                                    
    ✓ Sandbox 'my-assistant' created                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                                 
    [7/8] Setting up OpenClaw inside sandbox                                                                                                                                                                                                                                                                                     
    ──────────────────────────────────────────────────
  Error:   × status: FailedPrecondition, message: "sandbox is not ready", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 24 Apr 2026 04:48:09 GMT"} }                                                                                                                          
                                                                                                                                                                                                                                                                                                                                 
    Command failed (exit 1): /home/ubuntu/.local/bin/openshell sandbox connect my-assistant
    This error originated from the OpenShell runtime layer.                                                                                                                                                                                                                                                                      
    Docs: https://github.com/NVIDIA/OpenShell                                                                                                                                                                                                                                                                                    
  ```
                                                                                                                                                                                                                                                                                                                                 
  Key observations from this log:                        

  - `[1/8]` passes (including the new container-DNS preflight from #2101 which reports `✓ Container DNS resolution works` at the host-bridge layer).
  - `[2/8]`–`[6/8]` succeed. The sandbox image builds and uploads (724 MiB).                                                                                                                                                                                                                                                     
  - The first symptom of the underlying bug appears during the **Setting up sandbox DNS proxy** step inside `[6/8]`:                                                                                                                                                                                                             
    `container not found ("agent")` — kubectl can't exec into the `agent` container because it's been CrashLoopBackOff'd since the pod's init container finished (the `workspace-init` initContainer succeeds; the main `agent` container crashes on startup because it can't resolve `openshell.openshell.svc.cluster.local`).  
  - `[7/8]` then fails with the external `"sandbox is not ready"` from `openshell sandbox connect`.                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                                 
  </details>                                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                                 
  ## Root Cause (in code)                                

  `scripts/setup-jetson.sh:34-37`:
                                                                                                                                                                                                                                                                                                                                 
  ```bash
  if ((release >= 39)); then                                                                                                                                                                                                                                                                                                     
    info "Jetson detected (L4T $l4t_version) — this version does not require any host setup" >&2
    return 0                                                                                                                                                                                                                                                                                                                     
  fi
  ```                                                                                                                                                                                                                                                                                                                            
                                                         
  Because `get_jetpack_version()` returns empty for R39, `main()` exits before calling `configure_jetson_host()`, where the common `modprobe br_netfilter` / `sysctl` / `/etc/modules-load.d/nemoclaw.conf` lines live (lines 127–132 of the same file). The JP6 and JP7-R38 branches reach those lines via
  `configure_jetson_host()`; R39 doesn't.                                                                                                                                                                                                                                                                                        
   
  The assumption that R39 requires no host setup appears to be incorrect once k3s inside the OpenShell gateway container is involved — the module + sysctl are needed regardless of the JetPack version for in-gateway ClusterIP service routing to work.                                                                        
                                                         
  ## Proposed Fix                                                                                                                                                                                                                                                                                                                
                                                         
  Two small edits to `scripts/setup-jetson.sh`:                                                                                                                                                                                                                                                                                  
                                                         
  1. In `get_jetpack_version()`: drop the `>= 39` early-return, add a `39.*) printf "%s" "jp7-r39" ;;` case.                                                                                                                                                                                                                     
  2. In `configure_jetson_host()`: add a `jp7-r39) ;;` no-op branch (no iptables-legacy / daemon.json changes needed, same as R38) so the common `modprobe` + sysctl + persistence lines at the bottom of the function run.
                                                                                                                                                                                                                                                                                                                                 
  Happy to send a PR with this.                                                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                                                                 
  ## Workaround (for users hit by this today)                                                                                                                                                                                                                                                                                    
                                                         
  Run once on the host, then rerun `nemoclaw onboard`:                                                                                                                                                                                                                                                                           
                                                         
  ```bash                                                                                                                                                                                                                                                                                                                        
  sudo modprobe br_netfilter                             
  echo "br_netfilter" | sudo tee /etc/modules-load.d/nemoclaw.conf
  sudo tee /etc/sysctl.d/99-nemoclaw.conf <<EOF
  net.bridge.bridge-nf-call-iptables=1                                                                                                                                                                                                                                                                                           
  net.bridge.bridge-nf-call-ip6tables=1
  EOF                                                                                                                                                                                                                                                                                                                            
  sudo sysctl --system                                   
  ```                                                                                                                                                                                                                                                                                                                            
                                                         
  Then if a previous partial sandbox exists: `openshell sandbox delete <name>` and rerun onboard.                                                                                                                                                                                                                                
   
  ## Related                                                                                                                                                                                                                                                                                                                     
                                                         
  - Discovered while validating #2101 (container-DNS preflight). That PR fixes the build-layer DNS; this bug is a separate, host-layer DNS routing issue that surfaces after the build layer is unblocked. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8] #2418

Summary

Environment

Reproduction

Diagnostic Chain

Sandbox pod agent container crash-loops on cluster DNS

Service IP unreachable from new pods, pod IP works

`br_netfilter` not loaded on host

Loading the module immediately fixes it

Full `nemoclaw onboard` log

Root Cause (in code)

Proposed Fix

Workaround (for users hit by this today)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

setup-jetson.sh skips br_netfilter on JetPack R39, breaking k3s service DNS and onboard [7/8] #2418

Description

Summary

Environment

Reproduction

Diagnostic Chain

Sandbox pod agent container crash-loops on cluster DNS

Service IP unreachable from new pods, pod IP works

br_netfilter not loaded on host

Loading the module immediately fixes it

Full nemoclaw onboard log

Root Cause (in code)

Proposed Fix

Workaround (for users hit by this today)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`br_netfilter` not loaded on host

Full `nemoclaw onboard` log