fix(preflight): use uncached .invalid probe for container DNS check (#3630)#3651
Conversation
…VIDIA#3630) Closes NVIDIA#3630. The container DNS preflight queried `registry.npmjs.org`, whose answer Docker's embedded DNS resolver (and most upstream resolvers) caches aggressively. On a host where outbound UDP/TCP port 53 was blocked at the iptables OUTPUT chain, the cached resolution still made the probe report `✓ Container DNS resolution works` and onboard continued past preflight, even though the sandbox would later fail to pull npm packages or anything else needing fresh DNS. Replace the probe target with a random subdomain of the RFC 6761 reserved `.invalid` TLD. Properties: - Never cached anywhere: each call uses a fresh `nemoclaw-dns-probe-<16-hex>.invalid` name, so the query always round-trips through the upstream resolver. - Always NXDOMAIN: every compliant resolver returns NXDOMAIN immediately for any `.invalid` name (RFC 6761 §6.4). NXDOMAIN proves the resolver was reached even though the name does not resolve, which is the only invariant the probe needs. - Bypasses the cache hit that previously masked host-side egress blocks. The success regex is widened to accept both shapes busybox emits: - `Name:` + `Address:` lines for a real A-record resolution (kept for back-compat with custom probe names that resolve to a real IP). - `Server:` header + `** server can't find <name>: NXDOMAIN` for the `.invalid` probe sent by default. The resolver-identification block (`Server: <ip> / Address: <ip>:53`) appears in every busybox response, including malformed ones, so the new regex requires either a real `Name:` line OR an NXDOMAIN response body before declaring success. A regression test pins the malformed case so this guard cannot quietly weaken. Tests: `src/lib/onboard/preflight.test.ts` adds the NVIDIA#3630 NXDOMAIN success case, the malformed-response failure case, the random probe-name property, and a pinned-name override seam. 18/18 probeContainerDns tests pass. Existing success/servers_unreachable/ image_pull_failed/no_output/error cases pass unchanged. Signed-off-by: latenighthackathon <support@latenighthackathon.com> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR updates the container DNS preflight probe to generate randomized ChangesDNS Probe Randomization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Closes #3630. The
[1/8] Preflight checksstep randocker run --rm --pull=missing busybox:latest nslookup registry.npmjs.organd treatedName: registry.npmjs.org+Address: <ip>as proof that container DNS works. On a host where outbound UDP/TCP port 53 was blocked at the iptablesOUTPUTchain, that probe still passed because Docker's embedded DNS resolver (and the upstream resolver before it) had cached theregistry.npmjs.organswer from earlier successful lookups. Onboarding then continued past preflight and only hit the real DNS failure deeper in the sandbox build, where the error is much less actionable.Replace the probe target with a random subdomain of the RFC 6761 reserved
.invalidTLD, e.g.,nemoclaw-dns-probe-9c4f02a8b1d3e6f7.invalid. Three properties make this resilient to the cache hit that masked #3630:.invalidname (RFC 6761 §6.4), so the probe succeeds quickly when DNS works without depending on any specific A record being reachable from the container.;; connection timed out; no servers could be reached(the existingservers_unreachablesignature) and the probe fails fast.The success regex is widened to accept both shapes busybox emits:
Server: ... / Name: <name> / Address: <ip>for a real A-record resolution. Kept for back-compat with callers that pass a custom probe name resolving to a real IP.Server: ... / ** server can't find <name>: NXDOMAINfor the.invalidprobe sent by default.The resolver-identification block (
Server: <ip> / Address: <ip>:53) appears at the top of every busybox response, including malformed ones, so the new regex requires either a realName:line OR an NXDOMAIN response body before declaring success. A regression test pins the malformed case (;; reply from unexpected source: ...) so this guard cannot quietly weaken.Related Issue
Closes #3630
Changes
src/lib/onboard/preflight.ts:dnsProbeName()helper returns a randomnemoclaw-dns-probe-<16-hex>.invalidname (exported for the test seam; production callers never override).ProbeContainerDnsOptsgains an optionalprobeName?: stringto support pinned-name assertions in tests.docker run --rm --pull=missing busybox:latest nslookup ${probeName} 2>&1.^Server:AND eitherName:+Address:orserver can't find ... NXDOMAIN.src/lib/onboard/preflight.test.ts:;; reply from unexpected source: ...)..invalidprobe-name property (every call returns a distinct name matching thenemoclaw-dns-probe-<16-hex>.invalidshape).probeNameoverride seam.nslookup nemoclaw-dns-probe-<...>.invalidprefix.Type of Change
Verification
npx vitest run src/lib/onboard/preflight.test.ts -t "probeContainerDns"— 18/18 passok: true;; reply from unexpected sourcereturnsresolution_faileddnsProbeName()returns distinct random.invalidnames per callprobeNameoverride is wired into the spawned script.invalidname:Server: 192.168.65.7 / Address: 192.168.65.7:53 / ** server can't find <name>: NXDOMAIN--dns 240.0.0.0to simulate the iptables DROP path):;; connection timed out; no servers could be reachedName: registry.npmjs.orgprobe-output tests — the broader success regex still accepts the old shape for back-compat with custom probe names.Signed-off-by: latenighthackathon support@latenighthackathon.com
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests