Skip to content

fix(preflight): use uncached .invalid probe for container DNS check (#3630)#3651

Merged
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/preflight-dns-uncached-probe
May 17, 2026
Merged

fix(preflight): use uncached .invalid probe for container DNS check (#3630)#3651
cv merged 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/preflight-dns-uncached-probe

Conversation

@latenighthackathon

@latenighthackathon latenighthackathon commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #3630. The [1/8] Preflight checks step ran docker run --rm --pull=missing busybox:latest nslookup registry.npmjs.org and treated Name: registry.npmjs.org + Address: <ip> as proof that container DNS works. On a host where outbound UDP/TCP port 53 was blocked at the iptables OUTPUT chain, that probe still passed because Docker's embedded DNS resolver (and the upstream resolver before it) had cached the registry.npmjs.org answer from earlier successful lookups. Onboarding then continued past preflight and only hit the real DNS failure deeper in the sandbox build, where the error is much less actionable.

Replace the probe target with a random subdomain of the RFC 6761 reserved .invalid TLD, e.g., nemoclaw-dns-probe-9c4f02a8b1d3e6f7.invalid. Three properties make this resilient to the cache hit that masked #3630:

  • The random suffix is never cached anywhere — Docker's embedded resolver, the host's stub resolver, and any upstream caching layer see a brand-new name on every probe and have to forward the query.
  • Every compliant resolver returns NXDOMAIN immediately for any .invalid name (RFC 6761 §6.4), so the probe succeeds quickly when DNS works without depending on any specific A record being reachable from the container.
  • NXDOMAIN is genuine round-trip evidence: the resolver received the query, decided it does not exist, and answered. If host egress is blocked, busybox prints ;; connection timed out; no servers could be reached (the existing servers_unreachable signature) and the probe fails fast.

The success regex is widened to accept both shapes busybox emits:

  1. Server: ... / Name: <name> / Address: <ip> for a real A-record resolution. Kept for back-compat with callers that pass a custom probe name resolving to a real IP.
  2. Server: ... / ** server can't find <name>: NXDOMAIN for the .invalid probe sent by default.

The resolver-identification block (Server: <ip> / Address: <ip>:53) appears at the top of every busybox response, including malformed ones, so the new regex requires either a real Name: line OR an NXDOMAIN response body before declaring success. A regression test pins the malformed case (;; reply from unexpected source: ...) so this guard cannot quietly weaken.

Related Issue

Closes #3630

Changes

  • src/lib/onboard/preflight.ts:
    • New dnsProbeName() helper returns a random nemoclaw-dns-probe-<16-hex>.invalid name (exported for the test seam; production callers never override).
    • ProbeContainerDnsOpts gains an optional probeName?: string to support pinned-name assertions in tests.
    • Probe command is now templated as docker run --rm --pull=missing busybox:latest nslookup ${probeName} 2>&1.
    • Success-path regex requires ^Server: AND either Name:+Address: or server can't find ... NXDOMAIN.
  • src/lib/onboard/preflight.test.ts:
    • The previous "NXDOMAIN → resolution_failed" assertion is replaced by "NXDOMAIN → ok" (the new contract).
    • New regression test for the malformed-response failure case (;; reply from unexpected source: ...).
    • New test for the random .invalid probe-name property (every call returns a distinct name matching the nemoclaw-dns-probe-<16-hex>.invalid shape).
    • New test for the probeName override seam.
    • Existing tests for the spawned command updated to assert the new nslookup nemoclaw-dns-probe-<...>.invalid prefix.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx vitest run src/lib/onboard/preflight.test.ts -t "probeContainerDns" — 18/18 pass
  • Verified busybox output shapes on a working host:
    • DNS works + .invalid name: Server: 192.168.65.7 / Address: 192.168.65.7:53 / ** server can't find <name>: NXDOMAIN
    • DNS broken (via --dns 240.0.0.0 to simulate the iptables DROP path): ;; connection timed out; no servers could be reached
  • No regression in existing Name: registry.npmjs.org probe-output tests — the broader success regex still accepts the old shape for back-compat with custom probe names.

Signed-off-by: latenighthackathon support@latenighthackathon.com

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced DNS preflight probe validation to more reliably detect container resolver connectivity and distinguish between different resolution failure modes, improving overall container health checks.
  • Tests

    • Updated preflight DNS probe test suite with comprehensive coverage of enhanced validation logic and configuration scenarios.

Review Change Stack

…VIDIA#3630)

Closes NVIDIA#3630. The container DNS preflight queried `registry.npmjs.org`,
whose answer Docker's embedded DNS resolver (and most upstream resolvers)
caches aggressively. On a host where outbound UDP/TCP port 53 was
blocked at the iptables OUTPUT chain, the cached resolution still made
the probe report `✓ Container DNS resolution works` and onboard
continued past preflight, even though the sandbox would later fail to
pull npm packages or anything else needing fresh DNS.

Replace the probe target with a random subdomain of the RFC 6761
reserved `.invalid` TLD. Properties:

- Never cached anywhere: each call uses a fresh
  `nemoclaw-dns-probe-<16-hex>.invalid` name, so the query always
  round-trips through the upstream resolver.
- Always NXDOMAIN: every compliant resolver returns NXDOMAIN
  immediately for any `.invalid` name (RFC 6761 §6.4). NXDOMAIN
  proves the resolver was reached even though the name does not
  resolve, which is the only invariant the probe needs.
- Bypasses the cache hit that previously masked host-side egress
  blocks.

The success regex is widened to accept both shapes busybox emits:
- `Name:` + `Address:` lines for a real A-record resolution (kept for
  back-compat with custom probe names that resolve to a real IP).
- `Server:` header + `** server can't find <name>: NXDOMAIN` for the
  `.invalid` probe sent by default.

The resolver-identification block (`Server: <ip> / Address: <ip>:53`)
appears in every busybox response, including malformed ones, so the
new regex requires either a real `Name:` line OR an NXDOMAIN response
body before declaring success. A regression test pins the malformed
case so this guard cannot quietly weaken.

Tests: `src/lib/onboard/preflight.test.ts` adds the NVIDIA#3630 NXDOMAIN
success case, the malformed-response failure case, the random
probe-name property, and a pinned-name override seam. 18/18
probeContainerDns tests pass. Existing success/servers_unreachable/
image_pull_failed/no_output/error cases pass unchanged.

Signed-off-by: latenighthackathon <support@latenighthackathon.com>
Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@copy-pr-bot

copy-pr-bot Bot commented May 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad9eab20-d163-42ab-9959-610d3e7f2fdc

📥 Commits

Reviewing files that changed from the base of the PR and between 17965a7 and 0488130.

📒 Files selected for processing (2)
  • src/lib/onboard/preflight.test.ts
  • src/lib/onboard/preflight.ts

📝 Walkthrough

Walkthrough

This PR updates the container DNS preflight probe to generate randomized .invalid subdomain names instead of querying registry.npmjs.org. A new dnsProbeName() helper generates the probes, ProbeContainerDnsOpts gains an optional probeName override, and success detection is broadened to accept NXDOMAIN responses as proof of resolver reachability. Tests are extended to cover randomization, override behavior, and new response patterns.

Changes

DNS Probe Randomization

Layer / File(s) Summary
DNS probe name generation and options extension
src/lib/onboard/preflight.ts
Import randomBytes and add exported dnsProbeName() function that returns a randomized nemoclaw-dns-probe-<hex>.invalid name; extend ProbeContainerDnsOpts interface with optional probeName field.
Probe execution and success detection
src/lib/onboard/preflight.ts
Update probeContainerDns default command to run nslookup against the generated probe name instead of registry.npmjs.org; revise DNS success detection to recognize NXDOMAIN or standard Name:/Address: responses as resolver reachability.
Test coverage for probe randomization and success detection
src/lib/onboard/preflight.test.ts
Import dnsProbeName, add NXDOMAIN-based success and malformed-response failure test cases, update script-capture assertions, and add tests validating random probe name generation and probeName override behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

fix, Docker, NemoClaw CLI, Getting Started, Sandbox, bug

Suggested reviewers

  • cv

Poem

🐰 A probe hops through the .invalid domain,
Random hashes dancing, never the same—
No registry.npmjs to rely upon,
Just NXDOMAIN whispers when names are gone,
Preflight checks now cached-proof and free! 🎯

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: introducing an uncached .invalid probe for the container DNS preflight check, which directly addresses the issue (#3630) described in the PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@cv cv merged commit 7fbf47e into NVIDIA:main May 17, 2026
16 checks passed
@latenighthackathon latenighthackathon deleted the fix/preflight-dns-uncached-probe branch May 18, 2026 05:10
ericksoa pushed a commit that referenced this pull request May 18, 2026
## Summary
Updates the NemoClaw documentation for the v0.0.45 release by
summarizing the user-facing changes merged since v0.0.44 and bumping the
docs version metadata.
Refreshes generated user skills so agent-facing references match the
source docs.

## Changes
- Added v0.0.45 release notes covering onboarding recovery, local
inference, channel cleanup, share mount diagnostics, uninstall cleanup,
and security redaction updates.
- Updated command and troubleshooting docs for sandbox name limits, GPU
gateway reuse, DNS preflight behavior, channel removal cleanup, and
share mount path validation.
- Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user
skills from the docs.
- Source summary: #3672 -> `docs/reference/commands.md`: documented
channel removal detaching bridge providers and un-applying channel
policy presets.
- Source summary: #3678 -> `docs/about/release-notes.md`: documented
Ollama streamed usage accounting in the release notes.
- Source summary: #3670 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`: documented safe GPU gateway
replacement behavior.
- Source summary: #3664 -> `docs/about/release-notes.md`: documented
blueprint permission normalization in the release notes.
- Source summary: #3181 -> `docs/reference/troubleshooting.md`:
documented GPU toolkit guidance when host drivers work but passthrough
is disabled.
- Source summary: #3554 -> `docs/about/release-notes.md`: documented
host `openshell-gateway` cleanup during uninstall.
- Source summary: #3651 -> `docs/reference/troubleshooting.md`:
documented the uncached `.invalid` DNS preflight probe.
- Source summary: #3643 -> `docs/reference/commands.md`: included
existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated
docs.
- Source summary: #3647 -> `docs/reference/commands.md`: documented
remote sandbox path verification for `share mount`.
- Source summary: #3646 -> `docs/reference/commands.md`: included
existing local writable mount target guidance in generated docs.
- Source summary: #3642 -> `docs/inference/use-local-inference.md`,
`docs/reference/commands.md`: documented managed-vLLM model override and
gated-model token checks.
- Source summary: #3639 -> `docs/reference/commands.md`: documented the
63-character sandbox name limit.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

Commit hooks passed for the staged files. A standalone `npx prek run
--all-files` attempt was blocked by sandbox access to
`/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left
unchecked.

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced CLI command reference documentation with clearer guidance on
onboarding, GPU passthrough, inference configuration, channel removal,
and shared mounts.
* Improved troubleshooting sections with better DNS resolution and GPU
passthrough remediation steps.
  * Added documentation for overriding managed vLLM model selection.
* Updated release notes for v0.0.45 reflecting infrastructure and
workflow improvements.

* **Version Bump**
  * Released v0.0.45.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Nemoclaw][All Platforms]Preflight DNS check passes as “Container DNS resolution works” even when host outbound DNS is fully blocked via iptables

3 participants