Skip to content

feat(preflight): probe container DNS and surface corp-firewall workaround (#2101)#2349

Merged
ericksoa merged 14 commits into
mainfrom
fix/preflight-dns-probe-2101
Apr 24, 2026
Merged

feat(preflight): probe container DNS and surface corp-firewall workaround (#2101)#2349
ericksoa merged 14 commits into
mainfrom
fix/preflight-dns-probe-2101

Conversation

@yanyunl1991

@yanyunl1991 yanyunl1991 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Catch the #2101 failure mode (npm ci hangs ~15 min inside the sandbox build then prints the cryptic Exit handler never called) at preflight — before the user ever starts the build.

Root cause: DNS resolution from inside a docker container fails on corp networks that block outbound UDP:53 to public resolvers. systemd-resolved on the host bypasses this via DNS-over-TLS (TCP:853), but the container's /etc/resolv.conf gets docker's fallback nameservers (1.1.1.1 / 8.8.8.8) and queries them
over plain UDP:53 — which the firewall drops. After ~15 min of libuv retries, npm gives up but lingering handles keep the event loop alive past its exit hook, producing the cryptic error.

Preflight now runs a ~5-second probe and, on a confirmed DNS failure, surfaces a targeted error with platform- and runtime-aware workarounds — converting a 15-minute silent hang into a 5-second actionable message.

Related Issue

Fixes #2101.

Changes

  • src/lib/preflight.ts (+ ~130 lines)
    • probeContainerDns() — runs docker run --rm --pull=missing busybox nslookup registry.npmjs.org, parses the output, and returns a structured reason: ok / servers_unreachable / resolution_failed / image_pull_failed / no_output / error. Bounded by Node-level timeout: 20_000 on the runCapture call
      (portable; no dependency on a host timeout binary). Whitespace-only output is treated as empty.
    • getDockerBridgeGatewayIp() — reads docker network inspect bridge and extracts the first IPv4 gateway (handles dual-stack bridges whose {{range .IPAM.Config}}{{.Gateway}}{{end}} template concatenates IPv4+IPv6 with no separator). Each octet is validated 0–255 to reject garbage.
  • src/lib/onboard.ts (+ ~120 lines in preflight())
    • Wires the probe in right after the Docker-reachable check.
    • Only servers_unreachable and resolution_failed are fatal (the probe actually observed DNS failing); image_pull_failed / no_output / error warn and proceed (they're inconclusive — they don't prove container DNS is broken).
    • The fix hint is platform- and runtime-aware:
      • Linux (systemd-resolved): DNSStubListenerExtra=<bridge-ip> in /etc/systemd/resolved.conf.d/, then a jq-safe merge into /etc/docker/daemon.json (backs up the existing file first with a timestamp suffix; falls back to a fresh file only when none existed or jq isn't available).
      • macOS + Colima: colima stop && colima start --dns <corp-dns-ip>.
      • macOS + Docker Desktop: edit ~/.docker/daemon.json via jq, restart Docker via osascript -e 'quit app "Docker"' && open -a Docker.
      • macOS + unknown runtime: both CLI paths plus a pointer for Rancher Desktop / Podman.
      • Windows / WSL: Docker Desktop GUI plus an inlined Linux-native-docker-in-WSL fallback (also using the detected bridge IP).
      • Other (BSD etc.): generic daemon.json hint.
    • Every path ends at a verification step: docker run --rm busybox nslookup registry.npmjs.org.
  • src/lib/preflight.test.ts (+ ~170 lines; 16 new cases)
    • Probe: ok, resolution_failed (NXDOMAIN), servers_unreachable (UDP:53 block), image_pull_failed (two shapes), no_output for empty / null / whitespace-only output, thrown errors, the 400-byte detail truncation, and cross-platform timeout via Node.
    • Bridge-IP parser: happy path, non-default IP, empty / null / garbage / IPv6 / partial output, IPv4-first and IPv6-first dual-stack concatenations, first-of-many selection, and runCapture-throws.

Zero changes to bin/nemoclaw.js — the file is on the ts-migration guard's runtimeMoves list, and the new code deliberately lives outside the CLI runtime path so scripts/check-legacy-migrated-paths.ts passes clean.

What the user sees on failure (Linux, servers_unreachable)

[1/8] Preflight checks                                 
  ✓ Docker is running
  ✗ DNS resolution from inside a docker container failed.
    ;; connection timed out; no servers could be reached                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                               
  The sandbox build runs `npm ci` inside a container and needs to resolve                                                                                                                                                                                                                                                      
  registry.npmjs.org. On networks that block outbound UDP:53 to public DNS                                                                                                                                                                                                                                                     
  (common in corporate environments that force DNS-over-TLS on the host),                                                                                                                                                                                                                                                      
  the build appears to hang for ~15 minutes and then prints the cryptic                                                                                                                                                                                                                                                        
  `npm error Exit handler never called`. See issue #2101.                                                                                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                               
  Fix options:                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                               
  1. Make systemd-resolved reachable from containers (recommended):
       sudo mkdir -p /etc/systemd/resolved.conf.d/                                                                                                                                                                                                                                                                             
       printf '[Resolve]\nDNSStubListenerExtra=172.17.0.1\n' | sudo tee /etc/systemd/resolved.conf.d/docker-bridge.conf                                                                                                                                                                                                        
       sudo systemctl restart systemd-resolved                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                               
     Then add the dns key to /etc/docker/daemon.json (safely merges with existing config if jq is installed):                                                                                                                                                                                                                  
       sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.bak-$(date +%s) 2>/dev/null                                                                                                                                                                                                                                     
       { sudo jq '. + {"dns":["172.17.0.1"]}' /etc/docker/daemon.json 2>/dev/null || echo '{"dns":["172.17.0.1"]}'; } | sudo tee /etc/docker/daemon.json.new >/dev/null                                                                                                                                                        
       sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json                                                                                                                                                                                                                                                             
       sudo systemctl restart docker                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                               
  2. Configure an explicit UDP:53-capable DNS in /etc/docker/daemon.json
     (ask your IT team for an internal DNS server IP).                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                               
  Verify the fix worked:                                                                                                                                                                                                                                                                                                       
    docker run --rm busybox nslookup registry.npmjs.org                                                                                                                                                                                                                                                                        

macOS (Docker Desktop / Colima), Windows/WSL (Docker Desktop + WSL-native-docker fallback), and an "other platforms" generic branch print runtime-appropriate variants of the above — no pointer to a GUI unless there's no CLI alternative.

Why just the preflight probe, not also an opt-in --network=host for the build

An earlier iteration of this branch also added a NEMOCLAW_DOCKER_BUILD_NETWORK=host escape hatch: when set, NemoClaw would docker build --network=host locally and hand openshell the pre-built image via --from <tag> so the npm ci step could bypass the broken UDP:53 path. Testing on a real corp-firewalled Jetson
showed this is a structural dead-end under openshell's current CLI contract:

  • openshell sandbox create --from <Dockerfile> → builds the image and pushes it into the gateway cluster.
  • openshell sandbox create --from <image-tag> → treats the tag as already registry-reachable and skips the push.

A locally-built image reference satisfies neither: the gateway can't pull it from any registry, the sandbox pod fails to start, nemoclaw deletes the not-Ready sandbox, and the user ends up worse off than before (build succeeds but onboard doesn't complete). There's no openshell image push / image load command to
bridge the gap short of docker save | docker exec gateway ctr images import, which reaches into openshell internals and breaks on version bumps. A proper fix belongs upstream in openshell (e.g. a --build-network flag that applies during its own internal build).

The host-side remediation this PR surfaces is also strictly more complete: it covers every docker container on the host (gateway, sandbox, ad-hoc docker runs), not just the single build invocation that an opt-in env-var path would reach.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes — not run locally: prek binary download returns HTTP 400 in my sandbox environment; relying on CI to validate.
  • npm test — 60/60 tests in src/lib/preflight.test.ts pass.
  • Tests added or updated for new or changed behavior.
  • No secrets, API keys, or credentials committed.
  • Docs updated for user-facing behavior changes — the new preflight error text is the user-facing surface; no separate docs page needed.

Additional local checks:

  • npm run build:cli — clean.
  • npm run typecheck / npm run typecheck:cli — clean.
  • npm run lint — clean for the three changed files.
  • npm run ts-migration:guard -- --base origin/main --head HEADNo edits to migrated legacy paths or removed compatibility shims detected.
  • End-to-end reproduction: on a Jetson Orin (JetPack R39 / kernel 6.8.12-1018-tegra / Docker 29.4.0 / Node 22.22.2 / npm 10.9.7) attached to the NVIDIA corp network, nemoclaw onboard on main hangs at RUN npm ci && npm run build for 922 s and fails with Exit handler never called. Live strace on the hung
    libuv worker shows it cycling sendto(..., registry.npmjs.org ..., MSG_NOSIGNAL) / ppoll(timeout=5s) / close / socket / connect(1.1.1.1:53). With this PR, preflight instead prints the error above in ~5 seconds. Applying option 1's snippet lets npm ci complete in ~5 s and the sandbox image build finishes normally.

AI Disclosure

  • AI-assisted — tool: Claude Code

Summary by CodeRabbit

  • New Features

    • Onboarding now performs a container DNS connectivity check, provides targeted diagnostics with platform-aware remediation tips when DNS fails, and halts onboarding immediately for fatal DNS issues.
    • Detects container network gateway to tailor troubleshooting guidance.
  • Tests

    • Added unit tests covering DNS probing, gateway detection, failure classification, command behavior, and diagnostic output trimming.

Signed-off-by: Yanyun Liao yanyunl@nvidia.com

yanyunl1991 and others added 5 commits April 23, 2026 18:25
…de (#2101)

Addresses #2101. When the sandbox build's `npm ci` step runs inside a
docker container on a network that blocks outbound UDP:53 to public
DNS (common in corporate environments using DNS-over-TLS on the host),
npm hangs for ~15 minutes trying to resolve `registry.npmjs.org` and
then prints the cryptic `Exit handler never called`.

Two changes, both additive and off-path for normal users:

1. **Preflight container-DNS probe** (`src/lib/preflight.ts`).
   New `probeContainerDns()` runs `docker run --rm --pull=missing
   busybox nslookup registry.npmjs.org` and parses the output. Called
   from `preflight()` in `onboard.ts` right after the Docker reachable
   check. On failure, prints a targeted error listing three fix
   options (host-networking opt-in env var, persistent host config,
   explicit DNS in docker daemon.json) and exits — 10-second clear
   error instead of a 15-minute cryptic hang. Skipped when the user
   has already set NEMOCLAW_DOCKER_BUILD_NETWORK (signals they know
   about the issue).

2. **`NEMOCLAW_DOCKER_BUILD_NETWORK` env var** (`src/lib/onboard.ts`,
   `createSandbox`). `openshell sandbox create --from <Dockerfile>`
   has no way to pass `--network` through to `docker build`. When the
   env var is set, pre-build the image locally with
   `docker build --network=<value>` and then hand the resulting image
   tag to openshell via `--from <tag>` instead of `--from <Dockerfile>`.
   Verified end-to-end on a Jetson Orin where npm ci hangs in default
   bridge mode but succeeds in ~8 s with `--network=host`.

Tests: 8 new vitest cases in `src/lib/preflight.test.ts` cover the
probe's success / resolution-failed / no-output / command-override /
throw-handling / detail-truncation branches. 44/44 preflight tests
pass; typecheck clean; guard clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…x hint

The preflight DNS error previously hardcoded 172.17.0.1 in the
persistent fix snippet, but docker's default bridge gateway is
configurable (daemon.json `bip`, custom bridges, etc.) and can land on
any RFC1918 address. On hosts where the detected IP differs from the
default, the old hint would silently give the wrong config.

Add `getDockerBridgeGatewayIp()` that runs `docker network inspect
bridge` and parses the gateway. Use it in the preflight error and
handle three cases:
  - matches 172.17.0.1 (default): no note, keep hint concise.
  - differs from default: show `(detected your docker bridge gateway
    at X)` so the user sees what we substituted.
  - detection fails (docker returns nothing, parse fails, etc.): show
    `(could not auto-detect; verify with: docker network inspect
    bridge ...)` so the user knows the default is a guess.

6 additional unit tests cover the happy path, non-default IPs, empty /
garbage / IPv6 / partial-IP output, and thrown errors in runCapture.
50/50 preflight tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…e hatch (#2101)

Testing on a real corp-firewalled Jetson showed the env-var path
(docker build --network=host pre-built locally, then hand the image
tag to openshell via --from <tag>) has a structural flaw: openshell's
sandbox create treats --from <Dockerfile> and --from <image-tag> as
distinct paths per its own CLI contract — the former auto-pushes to
the gateway cluster, the latter does not. Our pre-built image
therefore never reaches the gateway, and the sandbox pod can't pull
it, hanging at not-Ready and getting cleaned up.

We could force-push via `docker save | docker exec gateway ctr
images import` but that's a hack targeting openshell internals and
breaks on version bumps. A proper fix belongs upstream in openshell
(a --build-network flag).

Drop the env-var path for now. The preflight probe + remediation
text retained below already gives users the real fix:

  1. systemd-resolved DNSStubListenerExtra on the docker bridge
  2. explicit UDP:53-capable DNS in /etc/docker/daemon.json

Both are persistent host-side changes that cover every docker flow,
not just our build step — strictly more robust than the build-only
workaround this commit removes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The heredoc form (`sudo tee ... <<EOF\n...\nEOF`) loses its line
breaks when users copy-paste from terminal output, so the config
files end up empty. Chain the full systemd-resolved + daemon.json
setup into one `&&` one-liner using `printf '...\n'` for the config
body and `>/dev/null` to silence `tee`'s echo.

Also add a short warning that the command overwrites
/etc/docker/daemon.json — users who already have custom docker
daemon config need to merge, not copy blindly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The &&-chained mega-line was hard to read and scared users.  Go back
to one command per printed line but drop the heredoc — use `printf
'...\n'` for the two config-body writes so each command stays a
valid, self-contained single-line shell command that pastes cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

preflight() now runs an in-container DNS probe immediately after confirming Docker; on probe failure it classifies the reason, optionally prints truncated probe output, computes the Docker bridge gateway IP (fallback to 172.17.0.1), prints platform-specific remediation, and exits with code 1; on success it logs container DNS works and continues.

Changes

Cohort / File(s) Summary
DNS Probe Core
src/lib/preflight.ts
Adds DnsProbeResult and ProbeContainerDnsOpts; implements probeContainerDns() (runs an overridable docker run --rm --pull=missing busybox:latest nslookup registry.npmjs.org, applies a fixed spawn timeout, classifies failures into image_pull_failed, servers_unreachable, resolution_failed, no_output, error, and truncates details) and getDockerBridgeGatewayIp() (runs docker network inspect bridge via injectable capture, parses IPv4 Gateway, returns null on errors/invalid output).
Onboarding Integration
src/lib/onboard.ts
Calls probeContainerDns() in preflight; on failure distinguishes image_pull_failed vs other DNS errors, optionally prints up to the last four non-empty probe lines, derives/uses Docker bridge gateway IP (with default fallback), emits platform-aware remediation (Linux systemd-resolved + daemon.json, macOS, Windows/WSL/container runtime notes), and exits with status 1; on success logs container DNS works and proceeds.
Tests
src/lib/preflight.test.ts
Adds tests for probeContainerDns() and getDockerBridgeGatewayIp() covering success and each failure classification, truncation behavior, command construction (no shell timeout wrappers), spawn timeout use, optional command override, gateway parsing for various outputs, and capture error handling.

Sequence Diagram

sequenceDiagram
    participant Onboard as Onboarding
    participant Preflight as preflight()
    participant Probe as probeContainerDns()
    participant Docker as Docker Engine
    participant DNS as DNS Server

    Onboard->>Preflight: start preflight
    Preflight->>Probe: run container DNS probe
    Probe->>Docker: docker run --rm --pull=missing busybox:latest nslookup registry.npmjs.org
    Docker->>DNS: UDP:53 query
    alt DNS returns Name/Address
        DNS-->>Docker: nslookup output (Name/Address)
        Docker-->>Probe: captured output
        Probe-->>Preflight: { ok: true }
        Preflight-->>Onboard: continue
    else pull/timeout/resolution/error
        DNS-->>Docker: no/invalid response or image pull error
        Docker-->>Probe: empty/error output
        Probe-->>Preflight: { ok: false, reason, details }
        Preflight->>Preflight: getDockerBridgeGatewayIp()
        Preflight-->>Onboard: print diagnostics + platform remediation
        Preflight-->>Onboard: exit 1
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐇 I tunneled into a tiny box so small,
I asked the name the network would recall.
If answers hopped back, I twitched my nose and ran,
If silence lingered, I traced the gateway plan,
Then off I bounded—fix in paw, joy in hand.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(preflight): probe container DNS and surface corp-firewall workaround' directly and clearly describes the main change: adding a DNS probe to the preflight flow and providing firewall-related guidance.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/preflight-dns-probe-2101

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/lib/preflight.ts (1)

839-841: Use strict IPv4 validation in bridge-gateway parsing.

The current regex accepts invalid IPv4 values like 999.999.999.999. Since node:net is already imported, use net.isIPv4() to avoid false positives.

Proposed change
-  // Accept only a dotted-quad IPv4 address. Empty / garbage / IPv6 → null.
-  if (!/^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed)) return null;
+  // Accept only a valid IPv4 address. Empty / garbage / IPv6 → null.
+  if (!net.isIPv4(trimmed)) return null;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/preflight.ts` around lines 839 - 841, Replace the loose dotted-quad
regex in the bridge-gateway parsing with a strict IPv4 check using the existing
net import: instead of testing /^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed), call
net.isIPv4(trimmed) and return trimmed only when that returns true (otherwise
return null); update the logic where 'trimmed' is validated so invalid addresses
like "999.999.999.999" are rejected.
src/lib/preflight.test.ts (1)

609-622: Also assert the timeout guard in command-shape coverage.

This command-shape test is close, but it doesn’t verify the timeout token. Adding that assertion would better lock in the anti-hang behavior this PR is targeting.

Suggested assertion update
     expect(captured[0]).toContain("docker run");
     expect(captured[0]).toContain("busybox");
     expect(captured[0]).toContain("registry.npmjs.org");
+    expect(captured[0]).toMatch(/\btimeout\b/);
+    expect(captured[0]).toMatch(/\b5s?\b/);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/preflight.test.ts` around lines 609 - 622, The test for
probeContainerDns needs to also assert the presence of the timeout guard in the
spawned command: update the test that sets runCaptureImpl (the captured array
and BUSYBOX_SUCCESS usage remain) to include an assertion that captured[0]
contains the timeout guard substring (e.g., "timeout" or the concrete timeout
flag/token your probeContainerDns implementation emits) so the test verifies the
anti-hang timeout token is present in the command shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2642-2696: The current branch treats any !dns.ok as fatal and
calls process.exit(1); change the logic so only a confirmed DNS failure triggers
process.exit(1) while inconclusive probe results (e.g., dns.status === 'error'
or dns.status === 'no_output' or missing dns.details) produce warnings but do
not abort onboarding. Concretely, in the block around dns.ok / dns.details and
getDockerBridgeGatewayIp(), keep the existing explanatory messages, but branch:
if (dns.ok) => success; else if (dns.status === 'error' || dns.status ===
'no_output' || !dns.details) => log the warning text (without the final
process.exit(1)) and continue; else (confirmed failure case) => print fix
options and call process.exit(1). Ensure you reference dns.ok, dns.details,
dns.status, getDockerBridgeGatewayIp(), and remove/guard the process.exit(1) so
transient or inconclusive probe outcomes don't block onboarding.

In `@src/lib/preflight.ts`:
- Around line 851-864: The DNS probe can hang because probeContainerDns() calls
runCaptureImpl(command, { ignoreError: true }) with no timeout; update
probeContainerDns() to pass a bounded timeout (e.g., 5000–10000 ms) into the
execution path by calling runCaptureImpl(command, { ignoreError: true, timeout:
5000 }), and ensure the runCapture implementation (the runCapture function and
any wrapper used as opts.runCaptureImpl) accepts and forwards a timeout option
to the underlying exec/spawn call so stalled docker runs are killed; reference
the command variable, runCaptureImpl, runCapture, and probeContainerDns to
locate the changes.

---

Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 609-622: The test for probeContainerDns needs to also assert the
presence of the timeout guard in the spawned command: update the test that sets
runCaptureImpl (the captured array and BUSYBOX_SUCCESS usage remain) to include
an assertion that captured[0] contains the timeout guard substring (e.g.,
"timeout" or the concrete timeout flag/token your probeContainerDns
implementation emits) so the test verifies the anti-hang timeout token is
present in the command shape.

In `@src/lib/preflight.ts`:
- Around line 839-841: Replace the loose dotted-quad regex in the bridge-gateway
parsing with a strict IPv4 check using the existing net import: instead of
testing /^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed), call net.isIPv4(trimmed) and
return trimmed only when that returns true (otherwise return null); update the
logic where 'trimmed' is validated so invalid addresses like "999.999.999.999"
are rejected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5bc9cd8a-ae5e-4e35-8a8c-f8386dc5db2e

📥 Commits

Reviewing files that changed from the base of the PR and between b1afc50 and ad58918.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts

Comment thread src/lib/onboard.ts
Comment thread src/lib/preflight.ts
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this issue that proposes a new feature to improve the preflight checks in NemoClaw.


Related open issues:

ericksoa
ericksoa previously approved these changes Apr 24, 2026

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid preflight addition. Turns a 15-minute cryptic npm hang into a 5-second targeted error with copy-pasteable remediation commands.

Looks good

  • probeContainerDns() — busybox nslookup in a throwaway container is the right probe. Lightweight, no side effects, catches the exact failure mode (UDP:53 blocked to public resolvers).
  • getDockerBridgeGatewayIp() — dynamically detects the bridge gateway instead of hardcoding 172.17.0.1. Handles non-default IPs, detection failure, and garbage output with clear user messaging for each case.
  • Remediation text is well-crafted — warning about daemon.json overwrite, one-command-per-line for clean copy-paste, and the "ask your IT team" fallback for option 2. The iterative commits (heredoc → mega-line → individual printf commands) show the UX was tested.
  • Decision to drop NEMOCLAW_DOCKER_BUILD_NETWORK is correct — the PR body explains the structural dead-end with openshell's --from contract clearly. Host-side DNS fix is strictly more robust.
  • Test coverage is thorough — 14 new tests covering success, resolution failure, no output, null output, command override, error handling, detail truncation, and all the bridge IP edge cases (non-default, empty, garbage, IPv6, partial, thrown errors).
  • All deps are injectablerunCaptureImpl, outputOverride, command override. Clean for testing.

Minor (non-blocking)

The busybox pull (--pull=missing) on a network where DNS is broken will itself fail/hang before the nslookup runs. The docker run timeout behavior depends on whether busybox is already cached locally. If it's not cached and DNS is broken, the probe may take longer than expected. Consider adding a --timeout or wrapping in a shell timeout to bound the probe duration. Not blocking since the common case (busybox cached from a prior Docker install) works fine.

LGTM.

@yanyunl1991

Copy link
Copy Markdown
Contributor Author

Hi @ericksoa , please don't merge this PR now, I still need more enhancements. Thanks

yanyunl1991 and others added 2 commits April 24, 2026 10:54
The systemd-resolved + /etc/docker/daemon.json snippet the preflight
hint used to print is Linux-specific — on macOS, Windows, or WSL
backed by Docker Desktop, those commands either don't exist or don't
touch the right daemon. Printing them anyway was misleading.

Gate the Linux-specific hint on host.platform === "linux" &&
!host.isWsl && host.systemctlAvailable (all already resolved by
assessHost). For other platforms print a minimal, correct pointer
instead:

  - macOS:            Docker Desktop → Settings → Docker Engine JSON.
  - Windows / WSL:    same, via Docker Desktop for Windows; with a
                      note for the rarer "native docker inside WSL"
                      setup that points back at the Linux fix.
  - fallback (bsd…):  generic "add dns to daemon.json and restart".

The probe itself is unchanged and stays platform-neutral
(`docker run busybox nslookup ...` works everywhere docker does).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
Four robustness improvements after review of the initial implementation:

1. Distinguish probe failure reasons. A failed `docker run busybox
   nslookup ...` could mean (a) DNS inside the container is blocked,
   (b) the DNS servers themselves are unreachable, (c) the busybox
   image couldn't be pulled, or (d) some other resolver error. The
   previous code lumped all of these under `resolution_failed` and
   printed a DNS-specific hint that was misleading for (c). Split into
   four reasons: `servers_unreachable`, `image_pull_failed`,
   `resolution_failed`, and `no_output`, with onboard's hint adapting
   its headline + hiding the #2101-specific advice when the problem is
   an image pull failure.

2. Bound the probe with `timeout 20` on Linux so preflight can't hang
   forever if docker itself is wedged. busybox nslookup already has a
   ~15 s internal budget; on macOS (no coreutils timeout by default)
   we rely on that alone rather than break portability.

3. Safer /etc/docker/daemon.json handling. The old hint overwrote the
   file, silently clobbering any existing config (registry-mirrors,
   log-driver, etc.). New hint backs up to daemon.json.bak-<timestamp>,
   then uses jq to merge the dns key into the existing JSON (falls
   back to writing a fresh file when none exists or jq isn't
   available).

4. Add a verification step at the end of every platform branch:
     docker run --rm busybox nslookup registry.npmjs.org
   so the user can confirm the fix before rerunning `nemoclaw onboard`.

Also expands the WSL branch to inline the Linux fix for the
"native docker inside WSL" case (previously just pointed at #2101
comments, which don't have the steps).

Tests: 6 new vitest cases cover servers_unreachable,
image_pull_failed (two variants), resolution_failed for NXDOMAIN,
and the Linux/darwin platform-prefix branching. 56/56 preflight
tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2742-2755: The Windows/WSL branch hard-codes "172.17.0.1" when
calling printLinuxFix, which breaks native WSL Docker installs that use a
non-default bridge; replace the literal with the same detected bridge gateway
variable used in the Linux branch (the bridge/bridgeGateway/bridgeIp variable
computed earlier) so printLinuxFix(hostBridgeIp, null) uses the actual detected
bridge IP; update the call in the host.platform === "win32" || host.isWsl branch
to pass that detected variable instead of the hard-coded string.
- Around line 2648-2769: The DNS probe currently treats dns.reason ===
"image_pull_failed" as a definitive DNS failure and calls process.exit(1), but
that only proves busybox:latest couldn't be pulled and can incorrectly block the
later pullAndResolveBaseImageDigest() for SANDBOX_BASE_IMAGE; modify the logic
in the DNS failure handling (the branch that checks dns.reason and the final
process.exit(1) call) to either retry the probe using the actual
SANDBOX_BASE_IMAGE or treat image_pull_failed as inconclusive by logging a
warning and continuing (i.e., avoid exiting) so the normal
pullAndResolveBaseImageDigest() can run; keep the existing DNS diagnostics
(dns.details, printLinuxFix, getDockerBridgeGatewayIp) intact but do not abort
the onboarding flow on image_pull_failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 312dc159-ad5d-4268-beba-61de2cffc78f

📥 Commits

Reviewing files that changed from the base of the PR and between 87dcc5d and 8ffb798.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/preflight.ts
  • src/lib/preflight.test.ts

Comment thread src/lib/onboard.ts
Comment thread src/lib/onboard.ts Outdated
…2101)

Docker Desktop's "Settings → Docker Engine" JSON editor actually reads
and writes ~/.docker/daemon.json, so a pure-CLI path exists and should
be preferred over asking users to click through the GUI.

Branch the macOS hint on the already-detected host.runtime:

- Colima:         `colima stop && colima start --dns <corp-dns-ip>`
- Docker Desktop: edit ~/.docker/daemon.json with jq (backup first),
                  then `osascript -e 'quit app "Docker"' && open -a Docker`
                  to restart.
- Unknown / podman / rancher: show both CLI options plus a pointer
                              for the less-common runtimes.

All three paths end at the same verification step
(`docker run --rm busybox nslookup registry.npmjs.org`) that the
earlier commit added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
@ericksoa ericksoa dismissed their stale review April 24, 2026 03:42

Withdrawing approval — need to reconsider.

… binary

The previous commit used a `timeout 20` prefix on the probe command on
Linux only, skipping macOS (no GNU coreutils `timeout` by default) and
Windows (has a same-named but unrelated `timeout.exe`). That left
non-Linux users without any protection if the docker daemon wedged.

Switch to Node's spawn-level `timeout: 20_000` option on the runCapture
call. Node kills the child with SIGTERM when the timer fires; under
`ignoreError: true` the result is simply an empty string, so the probe
falls through cleanly to its `no_output` branch. `--rm` on the docker
run cleans up any container that was started before the kill.

Benefits: identical behavior on Linux / macOS / Windows / WSL, no
host-side binary dependency, simpler code (drop the platform override
from ProbeContainerDnsOpts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/preflight.ts`:
- Around line 907-913: The probe result check treats whitespace-only output as
valid; modify the output validation where the variable output is inspected (the
block returning {ok:false, reason:"no_output", details:...}) to also consider
strings that are only whitespace as empty — i.e., check output.trim() (or
output?.trim()) and return the same no_output response when the trimmed length
is zero so whitespace/newline-only runner output is handled as no_output instead
of falling through.
- Around line 846-857: The current logic in src/lib/preflight.ts assumes the
docker inspect output (captured by runCaptureImpl) is a single dotted-quad and
uses /^\d{1,3}(?:\.\d{1,3}){3}$/ to validate trimmed, but dual-stack bridge
outputs can concatenate IPv4+IPv6 (e.g., "172.17.0.1db8::1") and fail; change
the validation to extract the first IPv4 from the output instead of requiring an
exact match — e.g., after obtaining raw/trimmed, run a regex search for the
first IPv4 pattern (/\b\d{1,3}(?:\.\d{1,3}){3}\b/) and return that match if
present (return null otherwise), or alternatively alter the docker format to
emit a separator between gateways before trimming; update the logic around
runCaptureImpl/raw/trimmed to use the extracted IPv4.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d617d04-7ff8-4961-9fe2-c43a2060b820

📥 Commits

Reviewing files that changed from the base of the PR and between fcaa98d and 25b6e32.

📒 Files selected for processing (2)
  • src/lib/preflight.test.ts
  • src/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/preflight.test.ts

Comment thread src/lib/preflight.ts Outdated
Comment thread src/lib/preflight.ts Outdated
yanyunl1991 and others added 5 commits April 24, 2026 12:31
…tput

Two robustness fixes from CodeRabbit review:

1. getDockerBridgeGatewayIp: docker's `{{range .IPAM.Config}}{{.Gateway}}
   {{end}}` template concatenates gateways without a separator. A
   dual-stack bridge emits "172.17.0.1fd00:abcd::1" (or the reverse
   order). The previous anchored regex
   `/^\d{1,3}(?:\.\d{1,3}){3}$/` rejected that outright, and word-
   boundary anchors don't help because "1" and "f" are both \w, and an
   IPv6 gateway ending in a digit ("...::1") eats the would-be start
   boundary of the IPv4. Search for the first dotted-quad anywhere in
   the output instead, then sanity-check each octet is 0-255 to reject
   garbage like "999.999.999.999".

2. probeContainerDns: the empty-output branch only tripped on null /
   "". A killed child or docker leftover newline ("  \n\n") would pass
   the falsy check and fall through to the regex ladder, misreporting
   as `resolution_failed`. Treat whitespace-only output the same as
   empty.

4 new vitest cases cover the IPv4-first and IPv6-first dual-stack
concatenations, the multi-IPv4 ordering, and the whitespace-only
probe output. 60/60 preflight tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The WSL branch's option 2 ("if you run docker natively inside WSL,
apply the Linux fix") previously hard-coded 172.17.0.1 in the
printf/daemon.json commands. Native docker inside WSL can have a
custom bridge subnet (via daemon.json `bip` or an explicit network
create) just like any other Linux host, so the printed snippet would
be wrong for those users.

Call getDockerBridgeGatewayIp() in the WSL branch too and plumb the
same detected / default / verify-with-this-command hint the Linux
branch already uses. Falls back to 172.17.0.1 when detection returns
null (e.g., Docker Desktop reflection that points elsewhere).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…ures

Per CodeRabbit review. The previous code exited(1) on every !dns.ok
case, which over-blocks:

- image_pull_failed only proves docker couldn't pull busybox. It does
  not prove DNS-from-container is broken. A host whose daemon can't
  reach Docker Hub right now but has cached images / pulls from
  ghcr.io (like sandbox-base) would be incorrectly stopped short of
  pullAndResolveBaseImageDigest() later in the flow.
- no_output / error signal the probe itself couldn't run — either the
  runCapture call was killed by our timeout, docker daemon was
  momentarily wedged, or something else transient. Not a reliable
  signal that DNS is broken.

Only servers_unreachable and resolution_failed actually observed a DNS
failure: the first is the clean #2101 UDP:53-block signature, the
second means the resolver answered but gave a non-answer. Both imply
the sandbox build's `npm ci` will hit the same outcome (the 15-min
hang this PR is trying to prevent), so they stay fatal with the full
remediation hints.

Everything else downgrades to a warning and preflight continues:

  ⚠ Container DNS probe inconclusive (reason: <...>).
    <details>
    Proceeding. If the sandbox build later hangs at `npm ci`,
    see issue #2101.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid preflight addition. Turns a 15-minute cryptic npm hang into a 5-second targeted error with copy-pasteable remediation commands.

All feedback from the previous review round has been addressed:

  • Timeout — 20s Node spawn-level timeout bounds the probe on every platform (no dependency on a host timeout binary). Inconclusive results (timeout, image pull failure, errors) warn and proceed rather than blocking.
  • Probe classification — only servers_unreachable and resolution_failed are fatal; image_pull_failed / no_output / error are correctly treated as inconclusive.
  • Bridge IP detectiongetDockerBridgeGatewayIp() handles dual-stack bridge output (concatenated IPv4+IPv6 with no separator) by regex-extracting the first valid dotted-quad. WSL path now uses the detected IP instead of hardcoding 172.17.0.1.
  • Test coverage — 16 new cases cover every classification path, bridge IP edge cases (non-default, empty, garbage, IPv6, dual-stack both orderings), command shape, and timeout propagation.

Platform-aware remediation text is well-crafted — the systemd-resolved / Colima / Docker Desktop / WSL branching gives users commands they can actually paste.

LGTM.

@ericksoa ericksoa merged commit 565f50e into main Apr 24, 2026
15 checks passed
@cv cv added the v0.0.25 label Apr 24, 2026
ericksoa pushed a commit that referenced this pull request Apr 25, 2026
## Summary
Catches up the user-facing docs for changes merged since v0.0.24 and
bumps `docs/versions1.json` so the next docs build publishes as 0.0.25.

## Changes
- `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T
39.x) entry to reflect the new conditional `br_netfilter` auto-load on
R39 hosts that lack it (from #2419), and added a new "DNS resolution
from inside docker fails (corporate firewall)" section covering the new
container-DNS preflight probe and platform-specific remediation paths
(from #2349).
- `docs/get-started/quickstart.md`: added a "Review the Configuration
Before the Sandbox Build" subsection documenting the new `[Y/n]` review
gate that fires after the sandbox-name prompt (from #2221).
- `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted
0.0.24, and trimmed to ten entries to match what
`scripts/bump-version.ts` writes. `docs/project.json` regenerates from
`versions1.json` at build time and now reports `0.0.25`.
- `.agents/skills/nemoclaw-user-get-started/` and
`.agents/skills/nemoclaw-user-reference/`: regenerated via
`scripts/docs-to-skills.py`.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding now shows a configuration review and requires explicit
confirmation before building a sandbox; non-interactive mode prints the
summary without prompting.

* **Documentation**
* Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and
conditional br_netfilter guidance.
* Added corporate-firewall DNS troubleshooting with platform-specific
remediation and a verification nslookup step.
  * Quickstart updated to document the new confirmation flow.

* **Chores**
  * Documentation version index updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
…ound (NVIDIA#2101) (NVIDIA#2349)

## Summary
   
Catch the NVIDIA#2101 failure mode (`npm ci` hangs ~15 min inside the sandbox
build then prints the cryptic `Exit handler never called`) at preflight
— before the user ever starts the build.
                                                         
Root cause: DNS resolution from inside a docker container fails on corp
networks that block outbound UDP:53 to public resolvers.
`systemd-resolved` on the host bypasses this via DNS-over-TLS (TCP:853),
but the container's `/etc/resolv.conf` gets docker's fallback
nameservers (`1.1.1.1` / `8.8.8.8`) and queries them
over plain UDP:53 — which the firewall drops. After ~15 min of libuv
retries, npm gives up but lingering handles keep the event loop alive
past its exit hook, producing the cryptic error.
Preflight now runs a ~5-second probe and, on a confirmed DNS failure,
surfaces a targeted error with platform- and runtime-aware workarounds —
converting a 15-minute silent hang into a 5-second actionable message.
   
## Related Issue
                                                         
  Fixes NVIDIA#2101.

## Changes
   
- **`src/lib/preflight.ts`** (+ ~130 lines)
- `probeContainerDns()` — runs `docker run --rm --pull=missing busybox
nslookup registry.npmjs.org`, parses the output, and returns a
structured reason: `ok` / `servers_unreachable` / `resolution_failed` /
`image_pull_failed` / `no_output` / `error`. Bounded by Node-level
`timeout: 20_000` on the runCapture call
(portable; no dependency on a host `timeout` binary). Whitespace-only
output is treated as empty.
- `getDockerBridgeGatewayIp()` — reads `docker network inspect bridge`
and extracts the first IPv4 gateway (handles dual-stack bridges whose
`{{range .IPAM.Config}}{{.Gateway}}{{end}}` template concatenates
IPv4+IPv6 with no separator). Each octet is validated 0–255 to reject
garbage.
- **`src/lib/onboard.ts`** (+ ~120 lines in `preflight()`)
    - Wires the probe in right after the Docker-reachable check.
- Only `servers_unreachable` and `resolution_failed` are fatal (the
probe actually observed DNS failing); `image_pull_failed` / `no_output`
/ `error` warn and proceed (they're inconclusive — they don't prove
container DNS is broken).
- The fix hint is platform- and runtime-aware:
- **Linux (systemd-resolved)**: `DNSStubListenerExtra=<bridge-ip>` in
`/etc/systemd/resolved.conf.d/`, then a `jq`-safe merge into
`/etc/docker/daemon.json` (backs up the existing file first with a
timestamp suffix; falls back to a fresh file only when none existed or
`jq` isn't available).
- **macOS + Colima**: `colima stop && colima start --dns <corp-dns-ip>`.
- **macOS + Docker Desktop**: edit `~/.docker/daemon.json` via `jq`,
restart Docker via `osascript -e 'quit app "Docker"' && open -a Docker`.
- **macOS + unknown runtime**: both CLI paths plus a pointer for Rancher
Desktop / Podman.
- **Windows / WSL**: Docker Desktop GUI plus an inlined
Linux-native-docker-in-WSL fallback (also using the detected bridge IP).
- **Other (BSD etc.)**: generic `daemon.json` hint.
- Every path ends at a verification step: `docker run --rm busybox
nslookup registry.npmjs.org`.
- **`src/lib/preflight.test.ts`** (+ ~170 lines; 16 new cases)
- Probe: `ok`, `resolution_failed` (NXDOMAIN), `servers_unreachable`
(UDP:53 block), `image_pull_failed` (two shapes), `no_output` for empty
/ null / whitespace-only output, thrown errors, the 400-byte detail
truncation, and cross-platform timeout via Node.
- Bridge-IP parser: happy path, non-default IP, empty / null / garbage /
IPv6 / partial output, IPv4-first and IPv6-first dual-stack
concatenations, first-of-many selection, and runCapture-throws.
**Zero changes to `bin/nemoclaw.js`** — the file is on the ts-migration
guard's `runtimeMoves` list, and the new code deliberately lives outside
the CLI runtime path so `scripts/check-legacy-migrated-paths.ts` passes
clean.
## What the user sees on failure (Linux, `servers_unreachable`)
                                                         
```
  [1/8] Preflight checks                                 
    ✓ Docker is running
    ✗ DNS resolution from inside a docker container failed.
;; connection timed out; no servers could be reached
The sandbox build runs `npm ci` inside a container and needs to resolve
registry.npmjs.org. On networks that block outbound UDP:53 to public DNS
(common in corporate environments that force DNS-over-TLS on the host),
the build appears to hang for ~15 minutes and then prints the cryptic
`npm error Exit handler never called`. See issue NVIDIA#2101.
Fix options:
    1. Make systemd-resolved reachable from containers (recommended):
sudo mkdir -p /etc/systemd/resolved.conf.d/
printf '[Resolve]\nDNSStubListenerExtra=172.17.0.1\n' | sudo tee
/etc/systemd/resolved.conf.d/docker-bridge.conf
sudo systemctl restart systemd-resolved
Then add the dns key to /etc/docker/daemon.json (safely merges with
existing config if jq is installed):
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.bak-$(date +%s)
2>/dev/null
{ sudo jq '. + {"dns":["172.17.0.1"]}' /etc/docker/daemon.json
2>/dev/null || echo '{"dns":["172.17.0.1"]}'; } | sudo tee
/etc/docker/daemon.json.new >/dev/null
sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json
sudo systemctl restart docker
2. Configure an explicit UDP:53-capable DNS in /etc/docker/daemon.json
(ask your IT team for an internal DNS server IP).
Verify the fix worked:
docker run --rm busybox nslookup registry.npmjs.org
  ```                                                    
macOS (Docker Desktop / Colima), Windows/WSL (Docker Desktop +
WSL-native-docker fallback), and an "other platforms" generic branch
print runtime-appropriate variants of the above — no pointer to a GUI
unless there's no CLI alternative.
## Why just the preflight probe, not also an opt-in `--network=host` for
the build
An earlier iteration of this branch also added a
`NEMOCLAW_DOCKER_BUILD_NETWORK=host` escape hatch: when set, NemoClaw
would `docker build --network=host` locally and hand openshell the
pre-built image via `--from <tag>` so the `npm ci` step could bypass the
broken UDP:53 path. Testing on a real corp-firewalled Jetson
showed this is a structural dead-end under openshell's current CLI
contract:
   
- `openshell sandbox create --from <Dockerfile>` → builds the image
**and pushes it into the gateway cluster**.
- `openshell sandbox create --from <image-tag>` → treats the tag as
already registry-reachable and **skips the push**.
A locally-built image reference satisfies neither: the gateway can't
pull it from any registry, the sandbox pod fails to start, nemoclaw
deletes the not-Ready sandbox, and the user ends up worse off than
before (build succeeds but onboard doesn't complete). There's no
`openshell image push` / `image load` command to
bridge the gap short of `docker save | docker exec gateway ctr images
import`, which reaches into openshell internals and breaks on version
bumps. A proper fix belongs upstream in openshell (e.g. a
`--build-network` flag that applies during its own internal build).
The host-side remediation this PR surfaces is also strictly more
complete: it covers **every** docker container on the host (gateway,
sandbox, ad-hoc `docker run`s), not just the single build invocation
that an opt-in env-var path would reach.
   
## Type of Change
                                                         
- [x] Code change (feature, bug fix, or refactor)
  - [ ] Code change with doc updates                     
  - [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)
   
## Verification
                                                         
- [ ] `npx prek run --all-files` passes — *not run locally: prek binary
download returns HTTP 400 in my sandbox environment; relying on CI to
validate.*
  - [x] `npm test` — 60/60 tests in `src/lib/preflight.test.ts` pass.
- [x] Tests added or updated for new or changed behavior.
  - [x] No secrets, API keys, or credentials committed.  
- [x] Docs updated for user-facing behavior changes — *the new preflight
error text is the user-facing surface; no separate docs page needed.*
**Additional local checks:**
  - `npm run build:cli` — clean.                         
  - `npm run typecheck` / `npm run typecheck:cli` — clean.
- `npm run lint` — clean for the three changed files.
- `npm run ts-migration:guard -- --base origin/main --head HEAD` — `No
edits to migrated legacy paths or removed compatibility shims detected.`
- **End-to-end reproduction**: on a Jetson Orin (JetPack R39 / kernel
6.8.12-1018-tegra / Docker 29.4.0 / Node 22.22.2 / npm 10.9.7) attached
to the NVIDIA corp network, `nemoclaw onboard` on `main` hangs at `RUN
npm ci && npm run build` for 922 s and fails with `Exit handler never
called`. Live `strace` on the hung
libuv worker shows it cycling `sendto(..., registry.npmjs.org ...,
MSG_NOSIGNAL) / ppoll(timeout=5s) / close / socket /
connect(1.1.1.1:53)`. With this PR, preflight instead prints the error
above in ~5 seconds. Applying option 1's snippet lets `npm ci` complete
in ~5 s and the sandbox image build finishes normally.
## AI Disclosure
                                                         
- [x] AI-assisted — tool: Claude Code

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding now performs a container DNS connectivity check, provides
targeted diagnostics with platform-aware remediation tips when DNS
fails, and halts onboarding immediately for fatal DNS issues.
* Detects container network gateway to tailor troubleshooting guidance.

* **Tests**
* Added unit tests covering DNS probing, gateway detection, failure
classification, command behavior, and diagnostic output trimming.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->


<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

---------

Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
DemianHeyGen pushed a commit to DemianHeyGen/NemoClaw that referenced this pull request Apr 30, 2026
## Summary
Catches up the user-facing docs for changes merged since v0.0.24 and
bumps `docs/versions1.json` so the next docs build publishes as 0.0.25.

## Changes
- `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T
39.x) entry to reflect the new conditional `br_netfilter` auto-load on
R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution
from inside docker fails (corporate firewall)" section covering the new
container-DNS preflight probe and platform-specific remediation paths
(from NVIDIA#2349).
- `docs/get-started/quickstart.md`: added a "Review the Configuration
Before the Sandbox Build" subsection documenting the new `[Y/n]` review
gate that fires after the sandbox-name prompt (from NVIDIA#2221).
- `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted
0.0.24, and trimmed to ten entries to match what
`scripts/bump-version.ts` writes. `docs/project.json` regenerates from
`versions1.json` at build time and now reports `0.0.25`.
- `.agents/skills/nemoclaw-user-get-started/` and
`.agents/skills/nemoclaw-user-reference/`: regenerated via
`scripts/docs-to-skills.py`.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding now shows a configuration review and requires explicit
confirmation before building a sandbox; non-interactive mode prints the
summary without prompting.

* **Documentation**
* Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and
conditional br_netfilter guidance.
* Added corporate-firewall DNS troubleshooting with platform-specific
remediation and a verification nslookup step.
  * Quickstart updated to document the new confirmation flow.

* **Chores**
  * Documentation version index updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ksapru pushed a commit to ksapru/NemoClaw that referenced this pull request May 12, 2026
## Summary
Catches up the user-facing docs for changes merged since v0.0.24 and
bumps `docs/versions1.json` so the next docs build publishes as 0.0.25.

## Changes
- `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T
39.x) entry to reflect the new conditional `br_netfilter` auto-load on
R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution
from inside docker fails (corporate firewall)" section covering the new
container-DNS preflight probe and platform-specific remediation paths
(from NVIDIA#2349).
- `docs/get-started/quickstart.md`: added a "Review the Configuration
Before the Sandbox Build" subsection documenting the new `[Y/n]` review
gate that fires after the sandbox-name prompt (from NVIDIA#2221).
- `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted
0.0.24, and trimmed to ten entries to match what
`scripts/bump-version.ts` writes. `docs/project.json` regenerates from
`versions1.json` at build time and now reports `0.0.25`.
- `.agents/skills/nemoclaw-user-get-started/` and
`.agents/skills/nemoclaw-user-reference/`: regenerated via
`scripts/docs-to-skills.py`.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding now shows a configuration review and requires explicit
confirmation before building a sandbox; non-interactive mode prints the
summary without prompting.

* **Documentation**
* Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and
conditional br_netfilter guidance.
* Added corporate-firewall DNS troubleshooting with platform-specific
remediation and a verification nslookup step.
  * Quickstart updated to document the new confirmation flow.

* **Chores**
  * Documentation version index updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wscurran wscurran added area: packaging Packages, images, registries, installers, or distribution feature PR adds or expands user-visible functionality platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution Docker labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PR adds or expands user-visible functionality platform: container Affects Docker, containerd, Podman, or images

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Jetson][Orin]Sandbox image build fails on RUN “npm ci && npm run build: npm” error Exit handler never called

4 participants