feat(preflight): probe container DNS and surface corp-firewall workaround (#2101)#2349
Conversation
…de (#2101) Addresses #2101. When the sandbox build's `npm ci` step runs inside a docker container on a network that blocks outbound UDP:53 to public DNS (common in corporate environments using DNS-over-TLS on the host), npm hangs for ~15 minutes trying to resolve `registry.npmjs.org` and then prints the cryptic `Exit handler never called`. Two changes, both additive and off-path for normal users: 1. **Preflight container-DNS probe** (`src/lib/preflight.ts`). New `probeContainerDns()` runs `docker run --rm --pull=missing busybox nslookup registry.npmjs.org` and parses the output. Called from `preflight()` in `onboard.ts` right after the Docker reachable check. On failure, prints a targeted error listing three fix options (host-networking opt-in env var, persistent host config, explicit DNS in docker daemon.json) and exits — 10-second clear error instead of a 15-minute cryptic hang. Skipped when the user has already set NEMOCLAW_DOCKER_BUILD_NETWORK (signals they know about the issue). 2. **`NEMOCLAW_DOCKER_BUILD_NETWORK` env var** (`src/lib/onboard.ts`, `createSandbox`). `openshell sandbox create --from <Dockerfile>` has no way to pass `--network` through to `docker build`. When the env var is set, pre-build the image locally with `docker build --network=<value>` and then hand the resulting image tag to openshell via `--from <tag>` instead of `--from <Dockerfile>`. Verified end-to-end on a Jetson Orin where npm ci hangs in default bridge mode but succeeds in ~8 s with `--network=host`. Tests: 8 new vitest cases in `src/lib/preflight.test.ts` cover the probe's success / resolution-failed / no-output / command-override / throw-handling / detail-truncation branches. 44/44 preflight tests pass; typecheck clean; guard clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…x hint
The preflight DNS error previously hardcoded 172.17.0.1 in the
persistent fix snippet, but docker's default bridge gateway is
configurable (daemon.json `bip`, custom bridges, etc.) and can land on
any RFC1918 address. On hosts where the detected IP differs from the
default, the old hint would silently give the wrong config.
Add `getDockerBridgeGatewayIp()` that runs `docker network inspect
bridge` and parses the gateway. Use it in the preflight error and
handle three cases:
- matches 172.17.0.1 (default): no note, keep hint concise.
- differs from default: show `(detected your docker bridge gateway
at X)` so the user sees what we substituted.
- detection fails (docker returns nothing, parse fails, etc.): show
`(could not auto-detect; verify with: docker network inspect
bridge ...)` so the user knows the default is a guess.
6 additional unit tests cover the happy path, non-default IPs, empty /
garbage / IPv6 / partial-IP output, and thrown errors in runCapture.
50/50 preflight tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…e hatch (#2101) Testing on a real corp-firewalled Jetson showed the env-var path (docker build --network=host pre-built locally, then hand the image tag to openshell via --from <tag>) has a structural flaw: openshell's sandbox create treats --from <Dockerfile> and --from <image-tag> as distinct paths per its own CLI contract — the former auto-pushes to the gateway cluster, the latter does not. Our pre-built image therefore never reaches the gateway, and the sandbox pod can't pull it, hanging at not-Ready and getting cleaned up. We could force-push via `docker save | docker exec gateway ctr images import` but that's a hack targeting openshell internals and breaks on version bumps. A proper fix belongs upstream in openshell (a --build-network flag). Drop the env-var path for now. The preflight probe + remediation text retained below already gives users the real fix: 1. systemd-resolved DNSStubListenerExtra on the docker bridge 2. explicit UDP:53-capable DNS in /etc/docker/daemon.json Both are persistent host-side changes that cover every docker flow, not just our build step — strictly more robust than the build-only workaround this commit removes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The heredoc form (`sudo tee ... <<EOF\n...\nEOF`) loses its line breaks when users copy-paste from terminal output, so the config files end up empty. Chain the full systemd-resolved + daemon.json setup into one `&&` one-liner using `printf '...\n'` for the config body and `>/dev/null` to silence `tee`'s echo. Also add a short warning that the command overwrites /etc/docker/daemon.json — users who already have custom docker daemon config need to merge, not copy blindly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The &&-chained mega-line was hard to read and scared users. Go back to one command per printed line but drop the heredoc — use `printf '...\n'` for the two config-body writes so each command stays a valid, self-contained single-line shell command that pastes cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughpreflight() now runs an in-container DNS probe immediately after confirming Docker; on probe failure it classifies the reason, optionally prints truncated probe output, computes the Docker bridge gateway IP (fallback to 172.17.0.1), prints platform-specific remediation, and exits with code 1; on success it logs container DNS works and continues. Changes
Sequence DiagramsequenceDiagram
participant Onboard as Onboarding
participant Preflight as preflight()
participant Probe as probeContainerDns()
participant Docker as Docker Engine
participant DNS as DNS Server
Onboard->>Preflight: start preflight
Preflight->>Probe: run container DNS probe
Probe->>Docker: docker run --rm --pull=missing busybox:latest nslookup registry.npmjs.org
Docker->>DNS: UDP:53 query
alt DNS returns Name/Address
DNS-->>Docker: nslookup output (Name/Address)
Docker-->>Probe: captured output
Probe-->>Preflight: { ok: true }
Preflight-->>Onboard: continue
else pull/timeout/resolution/error
DNS-->>Docker: no/invalid response or image pull error
Docker-->>Probe: empty/error output
Probe-->>Preflight: { ok: false, reason, details }
Preflight->>Preflight: getDockerBridgeGatewayIp()
Preflight-->>Onboard: print diagnostics + platform remediation
Preflight-->>Onboard: exit 1
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lib/preflight.ts (1)
839-841: Use strict IPv4 validation in bridge-gateway parsing.The current regex accepts invalid IPv4 values like
999.999.999.999. Sincenode:netis already imported, usenet.isIPv4()to avoid false positives.Proposed change
- // Accept only a dotted-quad IPv4 address. Empty / garbage / IPv6 → null. - if (!/^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed)) return null; + // Accept only a valid IPv4 address. Empty / garbage / IPv6 → null. + if (!net.isIPv4(trimmed)) return null;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/preflight.ts` around lines 839 - 841, Replace the loose dotted-quad regex in the bridge-gateway parsing with a strict IPv4 check using the existing net import: instead of testing /^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed), call net.isIPv4(trimmed) and return trimmed only when that returns true (otherwise return null); update the logic where 'trimmed' is validated so invalid addresses like "999.999.999.999" are rejected.src/lib/preflight.test.ts (1)
609-622: Also assert the timeout guard in command-shape coverage.This command-shape test is close, but it doesn’t verify the timeout token. Adding that assertion would better lock in the anti-hang behavior this PR is targeting.
Suggested assertion update
expect(captured[0]).toContain("docker run"); expect(captured[0]).toContain("busybox"); expect(captured[0]).toContain("registry.npmjs.org"); + expect(captured[0]).toMatch(/\btimeout\b/); + expect(captured[0]).toMatch(/\b5s?\b/);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/preflight.test.ts` around lines 609 - 622, The test for probeContainerDns needs to also assert the presence of the timeout guard in the spawned command: update the test that sets runCaptureImpl (the captured array and BUSYBOX_SUCCESS usage remain) to include an assertion that captured[0] contains the timeout guard substring (e.g., "timeout" or the concrete timeout flag/token your probeContainerDns implementation emits) so the test verifies the anti-hang timeout token is present in the command shape.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2642-2696: The current branch treats any !dns.ok as fatal and
calls process.exit(1); change the logic so only a confirmed DNS failure triggers
process.exit(1) while inconclusive probe results (e.g., dns.status === 'error'
or dns.status === 'no_output' or missing dns.details) produce warnings but do
not abort onboarding. Concretely, in the block around dns.ok / dns.details and
getDockerBridgeGatewayIp(), keep the existing explanatory messages, but branch:
if (dns.ok) => success; else if (dns.status === 'error' || dns.status ===
'no_output' || !dns.details) => log the warning text (without the final
process.exit(1)) and continue; else (confirmed failure case) => print fix
options and call process.exit(1). Ensure you reference dns.ok, dns.details,
dns.status, getDockerBridgeGatewayIp(), and remove/guard the process.exit(1) so
transient or inconclusive probe outcomes don't block onboarding.
In `@src/lib/preflight.ts`:
- Around line 851-864: The DNS probe can hang because probeContainerDns() calls
runCaptureImpl(command, { ignoreError: true }) with no timeout; update
probeContainerDns() to pass a bounded timeout (e.g., 5000–10000 ms) into the
execution path by calling runCaptureImpl(command, { ignoreError: true, timeout:
5000 }), and ensure the runCapture implementation (the runCapture function and
any wrapper used as opts.runCaptureImpl) accepts and forwards a timeout option
to the underlying exec/spawn call so stalled docker runs are killed; reference
the command variable, runCaptureImpl, runCapture, and probeContainerDns to
locate the changes.
---
Nitpick comments:
In `@src/lib/preflight.test.ts`:
- Around line 609-622: The test for probeContainerDns needs to also assert the
presence of the timeout guard in the spawned command: update the test that sets
runCaptureImpl (the captured array and BUSYBOX_SUCCESS usage remain) to include
an assertion that captured[0] contains the timeout guard substring (e.g.,
"timeout" or the concrete timeout flag/token your probeContainerDns
implementation emits) so the test verifies the anti-hang timeout token is
present in the command shape.
In `@src/lib/preflight.ts`:
- Around line 839-841: Replace the loose dotted-quad regex in the bridge-gateway
parsing with a strict IPv4 check using the existing net import: instead of
testing /^\d{1,3}(?:\.\d{1,3}){3}$/.test(trimmed), call net.isIPv4(trimmed) and
return trimmed only when that returns true (otherwise return null); update the
logic where 'trimmed' is validated so invalid addresses like "999.999.999.999"
are rejected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5bc9cd8a-ae5e-4e35-8a8c-f8386dc5db2e
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/preflight.test.tssrc/lib/preflight.ts
|
✨ Thanks for submitting this issue that proposes a new feature to improve the preflight checks in NemoClaw. Related open issues: |
ericksoa
left a comment
There was a problem hiding this comment.
Solid preflight addition. Turns a 15-minute cryptic npm hang into a 5-second targeted error with copy-pasteable remediation commands.
Looks good
probeContainerDns()— busybox nslookup in a throwaway container is the right probe. Lightweight, no side effects, catches the exact failure mode (UDP:53 blocked to public resolvers).getDockerBridgeGatewayIp()— dynamically detects the bridge gateway instead of hardcoding 172.17.0.1. Handles non-default IPs, detection failure, and garbage output with clear user messaging for each case.- Remediation text is well-crafted — warning about daemon.json overwrite, one-command-per-line for clean copy-paste, and the "ask your IT team" fallback for option 2. The iterative commits (heredoc → mega-line → individual printf commands) show the UX was tested.
- Decision to drop
NEMOCLAW_DOCKER_BUILD_NETWORKis correct — the PR body explains the structural dead-end with openshell's--fromcontract clearly. Host-side DNS fix is strictly more robust. - Test coverage is thorough — 14 new tests covering success, resolution failure, no output, null output, command override, error handling, detail truncation, and all the bridge IP edge cases (non-default, empty, garbage, IPv6, partial, thrown errors).
- All deps are injectable —
runCaptureImpl,outputOverride,commandoverride. Clean for testing.
Minor (non-blocking)
The busybox pull (--pull=missing) on a network where DNS is broken will itself fail/hang before the nslookup runs. The docker run timeout behavior depends on whether busybox is already cached locally. If it's not cached and DNS is broken, the probe may take longer than expected. Consider adding a --timeout or wrapping in a shell timeout to bound the probe duration. Not blocking since the common case (busybox cached from a prior Docker install) works fine.
LGTM.
|
Hi @ericksoa , please don't merge this PR now, I still need more enhancements. Thanks |
The systemd-resolved + /etc/docker/daemon.json snippet the preflight
hint used to print is Linux-specific — on macOS, Windows, or WSL
backed by Docker Desktop, those commands either don't exist or don't
touch the right daemon. Printing them anyway was misleading.
Gate the Linux-specific hint on host.platform === "linux" &&
!host.isWsl && host.systemctlAvailable (all already resolved by
assessHost). For other platforms print a minimal, correct pointer
instead:
- macOS: Docker Desktop → Settings → Docker Engine JSON.
- Windows / WSL: same, via Docker Desktop for Windows; with a
note for the rarer "native docker inside WSL"
setup that points back at the Linux fix.
- fallback (bsd…): generic "add dns to daemon.json and restart".
The probe itself is unchanged and stays platform-neutral
(`docker run busybox nslookup ...` works everywhere docker does).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
Four robustness improvements after review of the initial implementation: 1. Distinguish probe failure reasons. A failed `docker run busybox nslookup ...` could mean (a) DNS inside the container is blocked, (b) the DNS servers themselves are unreachable, (c) the busybox image couldn't be pulled, or (d) some other resolver error. The previous code lumped all of these under `resolution_failed` and printed a DNS-specific hint that was misleading for (c). Split into four reasons: `servers_unreachable`, `image_pull_failed`, `resolution_failed`, and `no_output`, with onboard's hint adapting its headline + hiding the #2101-specific advice when the problem is an image pull failure. 2. Bound the probe with `timeout 20` on Linux so preflight can't hang forever if docker itself is wedged. busybox nslookup already has a ~15 s internal budget; on macOS (no coreutils timeout by default) we rely on that alone rather than break portability. 3. Safer /etc/docker/daemon.json handling. The old hint overwrote the file, silently clobbering any existing config (registry-mirrors, log-driver, etc.). New hint backs up to daemon.json.bak-<timestamp>, then uses jq to merge the dns key into the existing JSON (falls back to writing a fresh file when none exists or jq isn't available). 4. Add a verification step at the end of every platform branch: docker run --rm busybox nslookup registry.npmjs.org so the user can confirm the fix before rerunning `nemoclaw onboard`. Also expands the WSL branch to inline the Linux fix for the "native docker inside WSL" case (previously just pointed at #2101 comments, which don't have the steps). Tests: 6 new vitest cases cover servers_unreachable, image_pull_failed (two variants), resolution_failed for NXDOMAIN, and the Linux/darwin platform-prefix branching. 56/56 preflight tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2742-2755: The Windows/WSL branch hard-codes "172.17.0.1" when
calling printLinuxFix, which breaks native WSL Docker installs that use a
non-default bridge; replace the literal with the same detected bridge gateway
variable used in the Linux branch (the bridge/bridgeGateway/bridgeIp variable
computed earlier) so printLinuxFix(hostBridgeIp, null) uses the actual detected
bridge IP; update the call in the host.platform === "win32" || host.isWsl branch
to pass that detected variable instead of the hard-coded string.
- Around line 2648-2769: The DNS probe currently treats dns.reason ===
"image_pull_failed" as a definitive DNS failure and calls process.exit(1), but
that only proves busybox:latest couldn't be pulled and can incorrectly block the
later pullAndResolveBaseImageDigest() for SANDBOX_BASE_IMAGE; modify the logic
in the DNS failure handling (the branch that checks dns.reason and the final
process.exit(1) call) to either retry the probe using the actual
SANDBOX_BASE_IMAGE or treat image_pull_failed as inconclusive by logging a
warning and continuing (i.e., avoid exiting) so the normal
pullAndResolveBaseImageDigest() can run; keep the existing DNS diagnostics
(dns.details, printLinuxFix, getDockerBridgeGatewayIp) intact but do not abort
the onboarding flow on image_pull_failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 312dc159-ad5d-4268-beba-61de2cffc78f
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/preflight.test.tssrc/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib/preflight.ts
- src/lib/preflight.test.ts
…2101) Docker Desktop's "Settings → Docker Engine" JSON editor actually reads and writes ~/.docker/daemon.json, so a pure-CLI path exists and should be preferred over asking users to click through the GUI. Branch the macOS hint on the already-detected host.runtime: - Colima: `colima stop && colima start --dns <corp-dns-ip>` - Docker Desktop: edit ~/.docker/daemon.json with jq (backup first), then `osascript -e 'quit app "Docker"' && open -a Docker` to restart. - Unknown / podman / rancher: show both CLI options plus a pointer for the less-common runtimes. All three paths end at the same verification step (`docker run --rm busybox nslookup registry.npmjs.org`) that the earlier commit added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
Withdrawing approval — need to reconsider.
… binary The previous commit used a `timeout 20` prefix on the probe command on Linux only, skipping macOS (no GNU coreutils `timeout` by default) and Windows (has a same-named but unrelated `timeout.exe`). That left non-Linux users without any protection if the docker daemon wedged. Switch to Node's spawn-level `timeout: 20_000` option on the runCapture call. Node kills the child with SIGTERM when the timer fires; under `ignoreError: true` the result is simply an empty string, so the probe falls through cleanly to its `no_output` branch. `--rm` on the docker run cleans up any container that was started before the kill. Benefits: identical behavior on Linux / macOS / Windows / WSL, no host-side binary dependency, simpler code (drop the platform override from ProbeContainerDnsOpts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/preflight.ts`:
- Around line 907-913: The probe result check treats whitespace-only output as
valid; modify the output validation where the variable output is inspected (the
block returning {ok:false, reason:"no_output", details:...}) to also consider
strings that are only whitespace as empty — i.e., check output.trim() (or
output?.trim()) and return the same no_output response when the trimmed length
is zero so whitespace/newline-only runner output is handled as no_output instead
of falling through.
- Around line 846-857: The current logic in src/lib/preflight.ts assumes the
docker inspect output (captured by runCaptureImpl) is a single dotted-quad and
uses /^\d{1,3}(?:\.\d{1,3}){3}$/ to validate trimmed, but dual-stack bridge
outputs can concatenate IPv4+IPv6 (e.g., "172.17.0.1db8::1") and fail; change
the validation to extract the first IPv4 from the output instead of requiring an
exact match — e.g., after obtaining raw/trimmed, run a regex search for the
first IPv4 pattern (/\b\d{1,3}(?:\.\d{1,3}){3}\b/) and return that match if
present (return null otherwise), or alternatively alter the docker format to
emit a separator between gateways before trimming; update the logic around
runCaptureImpl/raw/trimmed to use the extracted IPv4.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4d617d04-7ff8-4961-9fe2-c43a2060b820
📒 Files selected for processing (2)
src/lib/preflight.test.tssrc/lib/preflight.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/preflight.test.ts
…tput
Two robustness fixes from CodeRabbit review:
1. getDockerBridgeGatewayIp: docker's `{{range .IPAM.Config}}{{.Gateway}}
{{end}}` template concatenates gateways without a separator. A
dual-stack bridge emits "172.17.0.1fd00:abcd::1" (or the reverse
order). The previous anchored regex
`/^\d{1,3}(?:\.\d{1,3}){3}$/` rejected that outright, and word-
boundary anchors don't help because "1" and "f" are both \w, and an
IPv6 gateway ending in a digit ("...::1") eats the would-be start
boundary of the IPv4. Search for the first dotted-quad anywhere in
the output instead, then sanity-check each octet is 0-255 to reject
garbage like "999.999.999.999".
2. probeContainerDns: the empty-output branch only tripped on null /
"". A killed child or docker leftover newline (" \n\n") would pass
the falsy check and fall through to the regex ladder, misreporting
as `resolution_failed`. Treat whitespace-only output the same as
empty.
4 new vitest cases cover the IPv4-first and IPv6-first dual-stack
concatenations, the multi-IPv4 ordering, and the whitespace-only
probe output. 60/60 preflight tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
The WSL branch's option 2 ("if you run docker natively inside WSL,
apply the Linux fix") previously hard-coded 172.17.0.1 in the
printf/daemon.json commands. Native docker inside WSL can have a
custom bridge subnet (via daemon.json `bip` or an explicit network
create) just like any other Linux host, so the printed snippet would
be wrong for those users.
Call getDockerBridgeGatewayIp() in the WSL branch too and plumb the
same detected / default / verify-with-this-command hint the Linux
branch already uses. Falls back to 172.17.0.1 when detection returns
null (e.g., Docker Desktop reflection that points elsewhere).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
…ures Per CodeRabbit review. The previous code exited(1) on every !dns.ok case, which over-blocks: - image_pull_failed only proves docker couldn't pull busybox. It does not prove DNS-from-container is broken. A host whose daemon can't reach Docker Hub right now but has cached images / pulls from ghcr.io (like sandbox-base) would be incorrectly stopped short of pullAndResolveBaseImageDigest() later in the flow. - no_output / error signal the probe itself couldn't run — either the runCapture call was killed by our timeout, docker daemon was momentarily wedged, or something else transient. Not a reliable signal that DNS is broken. Only servers_unreachable and resolution_failed actually observed a DNS failure: the first is the clean #2101 UDP:53-block signature, the second means the resolver answered but gave a non-answer. Both imply the sandbox build's `npm ci` will hit the same outcome (the 15-min hang this PR is trying to prevent), so they stay fatal with the full remediation hints. Everything else downgrades to a warning and preflight continues: ⚠ Container DNS probe inconclusive (reason: <...>). <details> Proceeding. If the sandbox build later hangs at `npm ci`, see issue #2101. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
ericksoa
left a comment
There was a problem hiding this comment.
Solid preflight addition. Turns a 15-minute cryptic npm hang into a 5-second targeted error with copy-pasteable remediation commands.
All feedback from the previous review round has been addressed:
- Timeout — 20s Node spawn-level timeout bounds the probe on every platform (no dependency on a host
timeoutbinary). Inconclusive results (timeout, image pull failure, errors) warn and proceed rather than blocking. - Probe classification — only
servers_unreachableandresolution_failedare fatal;image_pull_failed/no_output/errorare correctly treated as inconclusive. - Bridge IP detection —
getDockerBridgeGatewayIp()handles dual-stack bridge output (concatenated IPv4+IPv6 with no separator) by regex-extracting the first valid dotted-quad. WSL path now uses the detected IP instead of hardcoding172.17.0.1. - Test coverage — 16 new cases cover every classification path, bridge IP edge cases (non-default, empty, garbage, IPv6, dual-stack both orderings), command shape, and timeout propagation.
Platform-aware remediation text is well-crafted — the systemd-resolved / Colima / Docker Desktop / WSL branching gives users commands they can actually paste.
LGTM.
## Summary Catches up the user-facing docs for changes merged since v0.0.24 and bumps `docs/versions1.json` so the next docs build publishes as 0.0.25. ## Changes - `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T 39.x) entry to reflect the new conditional `br_netfilter` auto-load on R39 hosts that lack it (from #2419), and added a new "DNS resolution from inside docker fails (corporate firewall)" section covering the new container-DNS preflight probe and platform-specific remediation paths (from #2349). - `docs/get-started/quickstart.md`: added a "Review the Configuration Before the Sandbox Build" subsection documenting the new `[Y/n]` review gate that fires after the sandbox-name prompt (from #2221). - `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted 0.0.24, and trimmed to ten entries to match what `scripts/bump-version.ts` writes. `docs/project.json` regenerates from `versions1.json` at build time and now reports `0.0.25`. - `.agents/skills/nemoclaw-user-get-started/` and `.agents/skills/nemoclaw-user-reference/`: regenerated via `scripts/docs-to-skills.py`. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding now shows a configuration review and requires explicit confirmation before building a sandbox; non-interactive mode prints the summary without prompting. * **Documentation** * Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and conditional br_netfilter guidance. * Added corporate-firewall DNS troubleshooting with platform-specific remediation and a verification nslookup step. * Quickstart updated to document the new confirmation flow. * **Chores** * Documentation version index updated. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ound (NVIDIA#2101) (NVIDIA#2349) ## Summary Catch the NVIDIA#2101 failure mode (`npm ci` hangs ~15 min inside the sandbox build then prints the cryptic `Exit handler never called`) at preflight — before the user ever starts the build. Root cause: DNS resolution from inside a docker container fails on corp networks that block outbound UDP:53 to public resolvers. `systemd-resolved` on the host bypasses this via DNS-over-TLS (TCP:853), but the container's `/etc/resolv.conf` gets docker's fallback nameservers (`1.1.1.1` / `8.8.8.8`) and queries them over plain UDP:53 — which the firewall drops. After ~15 min of libuv retries, npm gives up but lingering handles keep the event loop alive past its exit hook, producing the cryptic error. Preflight now runs a ~5-second probe and, on a confirmed DNS failure, surfaces a targeted error with platform- and runtime-aware workarounds — converting a 15-minute silent hang into a 5-second actionable message. ## Related Issue Fixes NVIDIA#2101. ## Changes - **`src/lib/preflight.ts`** (+ ~130 lines) - `probeContainerDns()` — runs `docker run --rm --pull=missing busybox nslookup registry.npmjs.org`, parses the output, and returns a structured reason: `ok` / `servers_unreachable` / `resolution_failed` / `image_pull_failed` / `no_output` / `error`. Bounded by Node-level `timeout: 20_000` on the runCapture call (portable; no dependency on a host `timeout` binary). Whitespace-only output is treated as empty. - `getDockerBridgeGatewayIp()` — reads `docker network inspect bridge` and extracts the first IPv4 gateway (handles dual-stack bridges whose `{{range .IPAM.Config}}{{.Gateway}}{{end}}` template concatenates IPv4+IPv6 with no separator). Each octet is validated 0–255 to reject garbage. - **`src/lib/onboard.ts`** (+ ~120 lines in `preflight()`) - Wires the probe in right after the Docker-reachable check. - Only `servers_unreachable` and `resolution_failed` are fatal (the probe actually observed DNS failing); `image_pull_failed` / `no_output` / `error` warn and proceed (they're inconclusive — they don't prove container DNS is broken). - The fix hint is platform- and runtime-aware: - **Linux (systemd-resolved)**: `DNSStubListenerExtra=<bridge-ip>` in `/etc/systemd/resolved.conf.d/`, then a `jq`-safe merge into `/etc/docker/daemon.json` (backs up the existing file first with a timestamp suffix; falls back to a fresh file only when none existed or `jq` isn't available). - **macOS + Colima**: `colima stop && colima start --dns <corp-dns-ip>`. - **macOS + Docker Desktop**: edit `~/.docker/daemon.json` via `jq`, restart Docker via `osascript -e 'quit app "Docker"' && open -a Docker`. - **macOS + unknown runtime**: both CLI paths plus a pointer for Rancher Desktop / Podman. - **Windows / WSL**: Docker Desktop GUI plus an inlined Linux-native-docker-in-WSL fallback (also using the detected bridge IP). - **Other (BSD etc.)**: generic `daemon.json` hint. - Every path ends at a verification step: `docker run --rm busybox nslookup registry.npmjs.org`. - **`src/lib/preflight.test.ts`** (+ ~170 lines; 16 new cases) - Probe: `ok`, `resolution_failed` (NXDOMAIN), `servers_unreachable` (UDP:53 block), `image_pull_failed` (two shapes), `no_output` for empty / null / whitespace-only output, thrown errors, the 400-byte detail truncation, and cross-platform timeout via Node. - Bridge-IP parser: happy path, non-default IP, empty / null / garbage / IPv6 / partial output, IPv4-first and IPv6-first dual-stack concatenations, first-of-many selection, and runCapture-throws. **Zero changes to `bin/nemoclaw.js`** — the file is on the ts-migration guard's `runtimeMoves` list, and the new code deliberately lives outside the CLI runtime path so `scripts/check-legacy-migrated-paths.ts` passes clean. ## What the user sees on failure (Linux, `servers_unreachable`) ``` [1/8] Preflight checks ✓ Docker is running ✗ DNS resolution from inside a docker container failed. ;; connection timed out; no servers could be reached The sandbox build runs `npm ci` inside a container and needs to resolve registry.npmjs.org. On networks that block outbound UDP:53 to public DNS (common in corporate environments that force DNS-over-TLS on the host), the build appears to hang for ~15 minutes and then prints the cryptic `npm error Exit handler never called`. See issue NVIDIA#2101. Fix options: 1. Make systemd-resolved reachable from containers (recommended): sudo mkdir -p /etc/systemd/resolved.conf.d/ printf '[Resolve]\nDNSStubListenerExtra=172.17.0.1\n' | sudo tee /etc/systemd/resolved.conf.d/docker-bridge.conf sudo systemctl restart systemd-resolved Then add the dns key to /etc/docker/daemon.json (safely merges with existing config if jq is installed): sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.bak-$(date +%s) 2>/dev/null { sudo jq '. + {"dns":["172.17.0.1"]}' /etc/docker/daemon.json 2>/dev/null || echo '{"dns":["172.17.0.1"]}'; } | sudo tee /etc/docker/daemon.json.new >/dev/null sudo mv /etc/docker/daemon.json.new /etc/docker/daemon.json sudo systemctl restart docker 2. Configure an explicit UDP:53-capable DNS in /etc/docker/daemon.json (ask your IT team for an internal DNS server IP). Verify the fix worked: docker run --rm busybox nslookup registry.npmjs.org ``` macOS (Docker Desktop / Colima), Windows/WSL (Docker Desktop + WSL-native-docker fallback), and an "other platforms" generic branch print runtime-appropriate variants of the above — no pointer to a GUI unless there's no CLI alternative. ## Why just the preflight probe, not also an opt-in `--network=host` for the build An earlier iteration of this branch also added a `NEMOCLAW_DOCKER_BUILD_NETWORK=host` escape hatch: when set, NemoClaw would `docker build --network=host` locally and hand openshell the pre-built image via `--from <tag>` so the `npm ci` step could bypass the broken UDP:53 path. Testing on a real corp-firewalled Jetson showed this is a structural dead-end under openshell's current CLI contract: - `openshell sandbox create --from <Dockerfile>` → builds the image **and pushes it into the gateway cluster**. - `openshell sandbox create --from <image-tag>` → treats the tag as already registry-reachable and **skips the push**. A locally-built image reference satisfies neither: the gateway can't pull it from any registry, the sandbox pod fails to start, nemoclaw deletes the not-Ready sandbox, and the user ends up worse off than before (build succeeds but onboard doesn't complete). There's no `openshell image push` / `image load` command to bridge the gap short of `docker save | docker exec gateway ctr images import`, which reaches into openshell internals and breaks on version bumps. A proper fix belongs upstream in openshell (e.g. a `--build-network` flag that applies during its own internal build). The host-side remediation this PR surfaces is also strictly more complete: it covers **every** docker container on the host (gateway, sandbox, ad-hoc `docker run`s), not just the single build invocation that an opt-in env-var path would reach. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes — *not run locally: prek binary download returns HTTP 400 in my sandbox environment; relying on CI to validate.* - [x] `npm test` — 60/60 tests in `src/lib/preflight.test.ts` pass. - [x] Tests added or updated for new or changed behavior. - [x] No secrets, API keys, or credentials committed. - [x] Docs updated for user-facing behavior changes — *the new preflight error text is the user-facing surface; no separate docs page needed.* **Additional local checks:** - `npm run build:cli` — clean. - `npm run typecheck` / `npm run typecheck:cli` — clean. - `npm run lint` — clean for the three changed files. - `npm run ts-migration:guard -- --base origin/main --head HEAD` — `No edits to migrated legacy paths or removed compatibility shims detected.` - **End-to-end reproduction**: on a Jetson Orin (JetPack R39 / kernel 6.8.12-1018-tegra / Docker 29.4.0 / Node 22.22.2 / npm 10.9.7) attached to the NVIDIA corp network, `nemoclaw onboard` on `main` hangs at `RUN npm ci && npm run build` for 922 s and fails with `Exit handler never called`. Live `strace` on the hung libuv worker shows it cycling `sendto(..., registry.npmjs.org ..., MSG_NOSIGNAL) / ppoll(timeout=5s) / close / socket / connect(1.1.1.1:53)`. With this PR, preflight instead prints the error above in ~5 seconds. Applying option 1's snippet lets `npm ci` complete in ~5 s and the sandbox image build finishes normally. ## AI Disclosure - [x] AI-assisted — tool: Claude Code <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding now performs a container DNS connectivity check, provides targeted diagnostics with platform-aware remediation tips when DNS fails, and halts onboarding immediately for fatal DNS issues. * Detects container network gateway to tailor troubleshooting guidance. * **Tests** * Added unit tests covering DNS probing, gateway detection, failure classification, command behavior, and diagnostic output trimming. <!-- end of auto-generated comment: release notes by coderabbit.ai --> <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com> --------- Signed-off-by: Yanyun Liao <yanyunl@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
## Summary Catches up the user-facing docs for changes merged since v0.0.24 and bumps `docs/versions1.json` so the next docs build publishes as 0.0.25. ## Changes - `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T 39.x) entry to reflect the new conditional `br_netfilter` auto-load on R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution from inside docker fails (corporate firewall)" section covering the new container-DNS preflight probe and platform-specific remediation paths (from NVIDIA#2349). - `docs/get-started/quickstart.md`: added a "Review the Configuration Before the Sandbox Build" subsection documenting the new `[Y/n]` review gate that fires after the sandbox-name prompt (from NVIDIA#2221). - `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted 0.0.24, and trimmed to ten entries to match what `scripts/bump-version.ts` writes. `docs/project.json` regenerates from `versions1.json` at build time and now reports `0.0.25`. - `.agents/skills/nemoclaw-user-get-started/` and `.agents/skills/nemoclaw-user-reference/`: regenerated via `scripts/docs-to-skills.py`. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding now shows a configuration review and requires explicit confirmation before building a sandbox; non-interactive mode prints the summary without prompting. * **Documentation** * Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and conditional br_netfilter guidance. * Added corporate-firewall DNS troubleshooting with platform-specific remediation and a verification nslookup step. * Quickstart updated to document the new confirmation flow. * **Chores** * Documentation version index updated. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Catches up the user-facing docs for changes merged since v0.0.24 and bumps `docs/versions1.json` so the next docs build publishes as 0.0.25. ## Changes - `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T 39.x) entry to reflect the new conditional `br_netfilter` auto-load on R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution from inside docker fails (corporate firewall)" section covering the new container-DNS preflight probe and platform-specific remediation paths (from NVIDIA#2349). - `docs/get-started/quickstart.md`: added a "Review the Configuration Before the Sandbox Build" subsection documenting the new `[Y/n]` review gate that fires after the sandbox-name prompt (from NVIDIA#2221). - `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted 0.0.24, and trimmed to ten entries to match what `scripts/bump-version.ts` writes. `docs/project.json` regenerates from `versions1.json` at build time and now reports `0.0.25`. - `.agents/skills/nemoclaw-user-get-started/` and `.agents/skills/nemoclaw-user-reference/`: regenerated via `scripts/docs-to-skills.py`. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding now shows a configuration review and requires explicit confirmation before building a sandbox; non-interactive mode prints the summary without prompting. * **Documentation** * Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and conditional br_netfilter guidance. * Added corporate-firewall DNS troubleshooting with platform-specific remediation and a verification nslookup step. * Quickstart updated to document the new confirmation flow. * **Chores** * Documentation version index updated. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Catch the #2101 failure mode (
npm cihangs ~15 min inside the sandbox build then prints the crypticExit handler never called) at preflight — before the user ever starts the build.Root cause: DNS resolution from inside a docker container fails on corp networks that block outbound UDP:53 to public resolvers.
systemd-resolvedon the host bypasses this via DNS-over-TLS (TCP:853), but the container's/etc/resolv.confgets docker's fallback nameservers (1.1.1.1/8.8.8.8) and queries themover plain UDP:53 — which the firewall drops. After ~15 min of libuv retries, npm gives up but lingering handles keep the event loop alive past its exit hook, producing the cryptic error.
Preflight now runs a ~5-second probe and, on a confirmed DNS failure, surfaces a targeted error with platform- and runtime-aware workarounds — converting a 15-minute silent hang into a 5-second actionable message.
Related Issue
Fixes #2101.
Changes
src/lib/preflight.ts(+ ~130 lines)probeContainerDns()— runsdocker run --rm --pull=missing busybox nslookup registry.npmjs.org, parses the output, and returns a structured reason:ok/servers_unreachable/resolution_failed/image_pull_failed/no_output/error. Bounded by Node-leveltimeout: 20_000on the runCapture call(portable; no dependency on a host
timeoutbinary). Whitespace-only output is treated as empty.getDockerBridgeGatewayIp()— readsdocker network inspect bridgeand extracts the first IPv4 gateway (handles dual-stack bridges whose{{range .IPAM.Config}}{{.Gateway}}{{end}}template concatenates IPv4+IPv6 with no separator). Each octet is validated 0–255 to reject garbage.src/lib/onboard.ts(+ ~120 lines inpreflight())servers_unreachableandresolution_failedare fatal (the probe actually observed DNS failing);image_pull_failed/no_output/errorwarn and proceed (they're inconclusive — they don't prove container DNS is broken).DNSStubListenerExtra=<bridge-ip>in/etc/systemd/resolved.conf.d/, then ajq-safe merge into/etc/docker/daemon.json(backs up the existing file first with a timestamp suffix; falls back to a fresh file only when none existed orjqisn't available).colima stop && colima start --dns <corp-dns-ip>.~/.docker/daemon.jsonviajq, restart Docker viaosascript -e 'quit app "Docker"' && open -a Docker.daemon.jsonhint.docker run --rm busybox nslookup registry.npmjs.org.src/lib/preflight.test.ts(+ ~170 lines; 16 new cases)ok,resolution_failed(NXDOMAIN),servers_unreachable(UDP:53 block),image_pull_failed(two shapes),no_outputfor empty / null / whitespace-only output, thrown errors, the 400-byte detail truncation, and cross-platform timeout via Node.Zero changes to
bin/nemoclaw.js— the file is on the ts-migration guard'sruntimeMoveslist, and the new code deliberately lives outside the CLI runtime path soscripts/check-legacy-migrated-paths.tspasses clean.What the user sees on failure (Linux,
servers_unreachable)macOS (Docker Desktop / Colima), Windows/WSL (Docker Desktop + WSL-native-docker fallback), and an "other platforms" generic branch print runtime-appropriate variants of the above — no pointer to a GUI unless there's no CLI alternative.
Why just the preflight probe, not also an opt-in
--network=hostfor the buildAn earlier iteration of this branch also added a
NEMOCLAW_DOCKER_BUILD_NETWORK=hostescape hatch: when set, NemoClaw woulddocker build --network=hostlocally and hand openshell the pre-built image via--from <tag>so thenpm cistep could bypass the broken UDP:53 path. Testing on a real corp-firewalled Jetsonshowed this is a structural dead-end under openshell's current CLI contract:
openshell sandbox create --from <Dockerfile>→ builds the image and pushes it into the gateway cluster.openshell sandbox create --from <image-tag>→ treats the tag as already registry-reachable and skips the push.A locally-built image reference satisfies neither: the gateway can't pull it from any registry, the sandbox pod fails to start, nemoclaw deletes the not-Ready sandbox, and the user ends up worse off than before (build succeeds but onboard doesn't complete). There's no
openshell image push/image loadcommand tobridge the gap short of
docker save | docker exec gateway ctr images import, which reaches into openshell internals and breaks on version bumps. A proper fix belongs upstream in openshell (e.g. a--build-networkflag that applies during its own internal build).The host-side remediation this PR surfaces is also strictly more complete: it covers every docker container on the host (gateway, sandbox, ad-hoc
docker runs), not just the single build invocation that an opt-in env-var path would reach.Type of Change
Verification
npx prek run --all-filespasses — not run locally: prek binary download returns HTTP 400 in my sandbox environment; relying on CI to validate.npm test— 60/60 tests insrc/lib/preflight.test.tspass.Additional local checks:
npm run build:cli— clean.npm run typecheck/npm run typecheck:cli— clean.npm run lint— clean for the three changed files.npm run ts-migration:guard -- --base origin/main --head HEAD—No edits to migrated legacy paths or removed compatibility shims detected.nemoclaw onboardonmainhangs atRUN npm ci && npm run buildfor 922 s and fails withExit handler never called. Livestraceon the hunglibuv worker shows it cycling
sendto(..., registry.npmjs.org ..., MSG_NOSIGNAL) / ppoll(timeout=5s) / close / socket / connect(1.1.1.1:53). With this PR, preflight instead prints the error above in ~5 seconds. Applying option 1's snippet letsnpm cicomplete in ~5 s and the sandbox image build finishes normally.AI Disclosure
Summary by CodeRabbit
New Features
Tests
Signed-off-by: Yanyun Liao yanyunl@nvidia.com