Skip to content

fix(net): allowlist DNS resilience by retrying, fail closed and allow longest-suffix wins#645

Open
ldarren wants to merge 3 commits into
boxlite-ai:mainfrom
ldarren:fix/gvproxy-allownet-dns-resilience
Open

fix(net): allowlist DNS resilience by retrying, fail closed and allow longest-suffix wins#645
ldarren wants to merge 3 commits into
boxlite-ai:mainfrom
ldarren:fix/gvproxy-allownet-dns-resilience

Conversation

@ldarren

@ldarren ldarren commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Issue

When allow_net is set, allow-listed hostnames sometimes resolve to 0.0.0.0 inside the guest, even though the host is reachable and box.create reports success. The failure is intermittent and silent — the only signal is allow_zones=N-1 total_zones=N in the gvproxy log.

# inside the guest
$ getent hosts iapi.example.com
0.0.0.0   iapi.example.com

These are the root causes and fixes:

  1. Host-side resolution had no retry or timeout — one transient hiccup (VPN flap, slow corp DNS) silently dropped the zone.
  2. When resolution definitively failed, box.create proceeded with a half-baked sinkhole instead of failing loudly.
  3. Zones were emitted in Go map-iteration order. gvisor-tap-vsock matches first-suffix-wins, so an subdomain.domain.com. query could land on a com. zone (created for github.com) before the domain.com. zone and return that zone's DefaultIP=0.0.0.0.

Changes

  • Retry host-side LookupIPAddr with exponential backoff (4 attempts: 100/300/900/2700 ms) and a 2 s per-attempt deadline
  • Fail-closed: propagate resolver errors through buildAllowNetDNSZonesbuildTapConfiggvproxy_create, surfaced via the new errOut channel from fix(net): surface gvproxy bind errors through FFI #612
  • Sort zones longest-name-first so the most-specific suffix wins; catch-all root zone stays last

Summary by CodeRabbit

  • New Features

    • DNS lookups during box creation now use automatic retries with exponential backoff and per-attempt timeouts.
    • DNS resolution is signal-aware and cancels promptly on SIGINT/SIGTERM.
  • Bug Fixes

    • Deterministic longest-suffix DNS zone ordering to avoid incorrectly sinkholing allow-listed hosts.
    • Box creation fails explicitly if allow-list hostnames cannot be resolved.
  • Documentation

    • Added troubleshooting and reference guidance for allow_net DNS behavior and recovery.
  • Tests

    • Expanded tests covering retries, timeouts, cancellation, and ordering.
  • Chores

    • Log build identity once per process.

@DorianZheng DorianZheng requested review from DorianZheng and G4614 June 3, 2026 05:23
@DorianZheng

Copy link
Copy Markdown
Member

Small scope nit on logBridgeBuild / bridgeBuildID: nice debugging aid, but it's unrelated to the allow_net fix and the hand-bumped "...allownet-dns-resilience-2026-06" constant will go stale (nobody will remember to bump it). Consider either splitting it into its own PR, or dropping the constant and logging just the auto-derived vcs.revision/vcs.modified/vcs.time fields so it never lies. The retry + fail-closed + longest-suffix-sort changes all look great. 👍

G4614 added a commit to G4614/boxlite that referenced this pull request Jun 3, 2026
…erage

Fixes the broken `LongestSuffixWinsBeforeRoot` test (it called the real
resolver and died with NXDOMAIN on every machine outside Merck corp DNS,
so the test failed for the wrong reason and never actually exercised
the sort logic). Adds three more tests to cover the gaps an audit
flagged in boxlite-ai#645:

  - LongestSuffixWinsBeforeRoot now drives buildAllowNetDNSZonesWith()
    with a fixedIPResolver. Same assertions on zone ordering, but
    hermetic — no network dependency, no public DNS lookup.

  - CatchAllRootZoneAlwaysLast pins the invariant the dns_filter.go
    comments state ("Catch-all root zone: sinkhole everything not
    explicitly allowed") but no test ever asserted. Runs across 0/1/4
    host configurations as sub-tests.

  - BuildTapConfig_DNSResolutionFailure_PropagatesErrorNotSilentConfig
    is the end-to-end leg covering buildAllowNetDNSZones →
    buildTapConfig. Drives the chain with an RFC 6761 `.invalid` host
    (guaranteed NXDOMAIN), overrides dnsLookupAttemptTimeoutVar to
    keep the test fast, asserts (err != nil) AND (tapCfg == nil) AND
    err mentions the failing hostname.

  - BuildTapConfig_NoAllowNet_HappyPath_NoDNSLookup is the negative
    control — if a regression makes buildTapConfig hit the resolver
    even when allow_net is empty, the test trips a 50ms upper bound
    on elapsed time instead of silently passing.

  - BuildTapConfig_ProductionSymptomReproducer is the operator-shaped
    reproducer: N=3 allow_net hosts where 2 resolve and 1 doesn't.
    Failure message names the exact production symptom shape
    ("silent half-baked config with allow_zones=N-1 total_zones=N")
    so a future regression review immediately knows what was lost.

Two-sided verified by toggling production lines:
  - Toggle `sort.Slice(... > ...)` to `<`: LongestSuffixWinsBeforeRoot
    flips red ("github.com. must precede com.")
  - Toggle `return nil, lastErr` to `return nil, nil` in lookupWithRetry:
    ProductionSymptomReproducer flips red with the literal "REGRESSION —
    fix gone: ... allow_zones=3 total_zones=4" message it was designed
    to surface.
  - Toggle the catch-all root zone from `append` (end) to prepend
    (start): CatchAllRootZoneAlwaysLast flips red on the 3 non-empty
    sub-tests (the 0_hosts sub-test stays green because there's no
    other zone to displace it).

Coverage delta:
  +149 lines of new tests + 1 mock resolver (fixedIPResolver, ~5 lines).
  All 14 tests now green on a public-internet host. PR boxlite-ai#645's original
  4 tests (RetriesTransientResolverFailure, FailsClosedAfterRetries,
  HonorsPerAttemptTimeout, PerHostFailureFailsAggregate) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@G4614 G4614 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also better to add tests to confirm the difference between pre and post fix

Have you meet DNS timeout I/O reported by mistake in this situation?

backoff := dnsLookupInitialBackoffVar

for attempt := 1; attempt <= dnsLookupAttempts; attempt++ {
ctx, cancel := context.WithTimeout(context.Background(), dnsLookupAttemptTimeoutVar)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to catch the parent cox so that Ctrl-C can end the retry timeout, thx

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, in the retry loop now takes a parentCtx parameter threaded down from gvproxy_create to handle interrupt signals

@ldarren

ldarren commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Also better to add tests to confirm the difference between pre and post fix

Have you meet DNS timeout I/O reported by mistake in this situation?

Sure, here are two snippets that demo before/after — easier to grok than test code:

Pre-fix (map-iteration order, what gvisor saw):

// gvisor matches first-suffix-wins, no most-specific preference.
// services/dns/dns.go:51 — strings.HasSuffix(q.Name, "."+zone.Name)
 
zones := []Zone{
  {Name: "com.",       DefaultIP: net.IPv4(0,0,0,0),
  Records: []Record{{Name: "github", IP: net.IPv4(20,205,243,166)}}},
  {Name: "sample.com.", DefaultIP: net.IPv4(0,0,0,0),
  Records: []Record{{Name: "iapi",   IP: net.IPv4(100,64,1,1)}}},
}
// query "iapi.sample.com." → matches com. first → no record "iapi.sample"
//                        → returns DefaultIP 0.0.0.0 

Post-fix (longest-name-first, what we now emit):

zones := []Zone{
  {Name: "sample.com.", DefaultIP: net.IPv4(0,0,0,0),
  Records: []Record{{Name: "iapi",   IP: net.IPv4(100,64,1,1)}}},
  {Name: "com.",       DefaultIP: net.IPv4(0,0,0,0),
  Records: []Record{{Name: "github", IP: net.IPv4(20,205,243,166)}}},
}
// query "iapi.sample.com." → matches sample.com. first → record "iapi"
//                        → returns 100.64.1.1 

I'd rather not encode this as a unit test — it would have to copy gvisor's matcher into our tests and would mislead future contributors. TestBuildAllowNetDNSZones_LongestSuffixWinsBeforeRoot already pins the contract on our side.

On DNS timeouts: doesn't happen as misreported — it just happens, on corporate networks with TLS scanners and proxy chains. Slow but legitimate resolution. The original code's context.Background() lookup didn't handle that error path at all; the new retry/fail-closed code does.

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR makes allow_net hostname resolution fail-closed and context-aware: hostnames are resolved at box creation with per-attempt timeouts, exponential backoff retries, and cancellation support; unresolved hosts cause build failure; zone ordering, tests, and docs updated accordingly.

Changes

DNS Allow-Net Resolution with Fail-Closed Semantics

Layer / File(s) Summary
DNS resolution retry logic and error handling
src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go
Adds context-aware hostname resolution infrastructure: buildAllowNetDNSZonesWith entry point accepts injected resolver; lookupWithRetry performs DNS lookups with configurable retries, per-attempt timeout, exponential backoff, and context cancellation support; resolveAndAddRecords calls lookupWithRetry and propagates errors instead of silently logging; zone construction is ordered deterministically by DNS name length (longest first).
Comprehensive DNS resolution and retry testing
src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go
Adds test doubles (flaky, always-failing, hanging, mixed resolvers) and extensive test cases covering: retry success/failure semantics, per-attempt timeout enforcement, context cancellation interrupting retries and backoff, aggregate error reporting for multi-host resolution failures, and deterministic zone ordering preventing zone bypass.
Shim entry point with signal-aware context and build logging
src/deps/libgvproxy-sys/gvproxy-bridge/main.go
main.go updates tap configuration building to be context-aware and fail-closed: buildDNSZones and buildTapConfig now accept context and return errors; gvproxy_create creates a signal-cancelable context (SIGINT/SIGTERM) that interrupts DNS resolution and backoff; build identity metadata is logged exactly once per process; on tap config failure, error is logged and gvproxy_create returns -1.
Dependent test suite updates
src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go, src/deps/libgvproxy-sys/gvproxy-bridge/gvproxy_create_latency_test.go
Updates test files that call buildTapConfig and buildDNSZones to pass context.Background() and handle returned errors; three tap-config tests and warmup/per-iteration config building updated with explicit error handling.
User-facing documentation
docs/reference/README.md, docs/faq.md
Reference docs describe host-side resolution, IP pinning, exponential backoff retries, and hard failure on persistent resolution errors; FAQ adds troubleshooting guidance for 0.0.0.0 symptoms, log inspection, host-side verification, and box recreation after DNS recovery.

🎯 3 (Moderate) | ⏱️ ~22 minutes

🐰 A rabbit hops through DNS zones so fine,
With retries and backoff in perfect design,
No more silent sinkholing at zero-zero-zero,
Just fail-closed semantics—our networking hero!
Context cancels quick when signals appear,
Box creation stays honest, crystal clear.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main changes: retry logic, fail-closed behavior, and longest-suffix DNS zone ordering for allow_net resilience.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cla-assistant

cla-assistant Bot commented Jun 3, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@ldarren

ldarren commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Small scope nit on logBridgeBuild / bridgeBuildID: nice debugging aid, but it's unrelated to the allow_net fix and the hand-bumped "...allownet-dns-resilience-2026-06" constant will go stale (nobody will remember to bump it). Consider either splitting it into its own PR, or dropping the constant and logging just the auto-derived vcs.revision/vcs.modified/vcs.time fields so it never lies. The retry + fail-closed + longest-suffix-sort changes all look great. 👍

agreed, constant removed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/faq.md`:
- Around line 288-305: The fenced log examples are missing language tags which
triggers markdownlint MD040; update each triple-backtick fence surrounding the
INFO/WARN/ERROR gvproxy examples to include a language identifier such as `text`
or `log` (e.g., replace ``` with ```text) so the blocks containing lines like
"INFO gvproxy: allowNet: DNS sinkhole..." "WARN gvproxy: allowNet: DNS
resolution failed..." and "ERROR gvproxy: allowNet: DNS resolution failed after
retries..." are properly annotated.

In `@src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go`:
- Around line 64-67: The test currently relies on real DNS when calling
buildTapConfig(context.Background(), config, types.QemuProtocol); replace that
with a deterministic stubbed resolver for the allow-net path: create a test-only
net.Resolver (or a resolver hook provided by your config) that returns a fixed
IP for "example.com" and inject it (via the config struct or the context passed
to buildTapConfig) so the call to buildTapConfig uses the fake resolver instead
of the host resolver; update the test to construct that stub resolver and pass
it in before invoking buildTapConfig (referencing buildTapConfig, tapConfig,
config, and types.QemuProtocol to locate where to inject).

In `@src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go`:
- Around line 14-21: Tests call buildAllowNetDNSZones with real hostnames
causing network-dependent failures; change them to use the deterministic helper
buildAllowNetDNSZonesWith and a fake resolver (already present in this file) so
DNS lookups are mocked. Replace calls to buildAllowNetDNSZones([...
"api.openai.com", "*.anthropic.com" ...]) with buildAllowNetDNSZonesWith(ctx,
fakeResolver, [...]) (or the existing fake resolver helper), update any expected
zones/assertions to match the fake resolver responses, and apply the same change
to the other referenced blocks (lines around 38-41 and 69-78) to remove reliance
on public DNS.

In `@src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go`:
- Around line 189-215: The current loop drops IPv6 answers and returns nil even
when no IPv4 addresses are present, violating the fail-closed guarantee; modify
the allow-net resolution block (the code that builds zoneRecords with
types.Record and tracks v4Count/v4Strs after calling net.LookupIPAddr via
lookupWithRetry/LookupIPAddr) to check if v4Count == 0 and, in that case, log
the condition and return a non-nil error so lookupWithRetry treats the lookup as
failed rather than succeeded; keep the existing logging of v4_count/v4_ips and
only append types.Record entries for IPv4, but ensure the function returns an
error when no IPv4 addresses were found.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0c2ea517-89ea-44e6-8454-fcf012d70907

📥 Commits

Reviewing files that changed from the base of the PR and between a438d6b and 32fd74a.

📒 Files selected for processing (7)
  • docs/faq.md
  • docs/reference/README.md
  • src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/gvproxy_create_latency_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/main.go

Comment thread docs/faq.md
Comment on lines +288 to +305
```
INFO gvproxy: allowNet: DNS sinkhole configured allow_zones=5 total_zones=6
INFO gvproxy: allowNet TCP: filter initialized hostnames=5 ...
```

When a host couldn't be resolved you'll see retry warnings followed by
either a recovery (newer builds — the box still comes up cleanly):

```
WARN gvproxy: allowNet: DNS resolution failed, will retry attempt=1 hostname=iapi.example.com next_delay=100ms
INFO gvproxy: allowNet: DNS resolution succeeded after retry attempts=2 hostname=iapi.example.com
```

…or a hard failure, which now aborts `box.create`:

```
ERROR gvproxy: allowNet: DNS resolution failed after retries attempts=4 hostname=iapi.example.com error="..."
ERROR gvproxy: Failed to build gvproxy tap config error="allowNet: resolve \"iapi.example.com\": ..."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to these fenced blocks.

These new sample log/code fences are missing a language, and markdownlint is already flagging them (MD040). Adding text or log here will clear the warning.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 288-288: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 296-296: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 303-303: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/faq.md` around lines 288 - 305, The fenced log examples are missing
language tags which triggers markdownlint MD040; update each triple-backtick
fence surrounding the INFO/WARN/ERROR gvproxy examples to include a language
identifier such as `text` or `log` (e.g., replace ``` with ```text) so the
blocks containing lines like "INFO gvproxy: allowNet: DNS sinkhole..." "WARN
gvproxy: allowNet: DNS resolution failed..." and "ERROR gvproxy: allowNet: DNS
resolution failed after retries..." are properly annotated.

Comment on lines +64 to +67
tapConfig, err := buildTapConfig(context.Background(), config, types.QemuProtocol)
if err != nil {
t.Fatalf("unexpected error: %v", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid live DNS in this unit test.

This assertion now depends on the runner being able to resolve example.com, so the test can fail for network reasons unrelated to the code under test. Please stub the allow-net resolution path here instead of calling the real host resolver.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go` around lines 64
- 67, The test currently relies on real DNS when calling
buildTapConfig(context.Background(), config, types.QemuProtocol); replace that
with a deterministic stubbed resolver for the allow-net path: create a test-only
net.Resolver (or a resolver hook provided by your config) that returns a fixed
IP for "example.com" and inject it (via the config struct or the context passed
to buildTapConfig) so the call to buildTapConfig uses the fake resolver instead
of the host resolver; update the test to construct that stub resolver and pass
it in before invoking buildTapConfig (referencing buildTapConfig, tapConfig,
config, and types.QemuProtocol to locate where to inject).

Comment thread src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go
Comment on lines +189 to +215
v4Count := 0
v4Strs := make([]string, 0, len(ips))
for _, ip := range ips {
if ip.IP.To4() == nil {
continue // Skip IPv6 for now
}
v4Count++
v4Strs = append(v4Strs, ip.IP.String())
zoneRecords[zoneName] = append(zoneRecords[zoneName], types.Record{
Name: trimmed,
IP: ip.IP.To4(),
})
logrus.WithFields(logrus.Fields{
"hostname": hostname,
"ip": ip.IP,
"zone": zoneName,
"label": trimmed,
}).Debug("allowNet: resolved and added DNS record")
}

// One Info line per allow-listed host. Without this it's invisible
// whether a hostname's lookup succeeded but yielded only IPv6 (which
// we drop), which would silently sinkhole that host even though no
// retry/error fired. v4_count=0 is the smoking gun for that case.
logrus.WithFields(logrus.Fields{
"hostname": hostname,
"zone": zoneName,
"label": trimmed,
"v4_count": v4Count,
"v4_ips": v4Strs,
}).Info("allowNet: host resolved")

return nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when resolution yields only IPv6 answers.

lookupWithRetry treats any non-empty LookupIPAddr result as success, but this block drops every non-IPv4 address and still returns nil. For an IPv6-only allow-listed host, box creation will still succeed and the guest will sinkhole it to 0.0.0.0, which breaks the new fail-closed guarantee.

💡 Suggested fix
 	for _, ip := range ips {
 		if ip.IP.To4() == nil {
 			continue // Skip IPv6 for now
 		}
 		v4Count++
 		v4Strs = append(v4Strs, ip.IP.String())
 		zoneRecords[zoneName] = append(zoneRecords[zoneName], types.Record{
 			Name: trimmed,
 			IP:   ip.IP.To4(),
 		})
 	}
+
+	if v4Count == 0 {
+		return fmt.Errorf("allowNet: no IPv4 records returned for %q", hostname)
+	}
 
 	// One Info line per allow-listed host. Without this it's invisible
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go` around lines 189 - 215,
The current loop drops IPv6 answers and returns nil even when no IPv4 addresses
are present, violating the fail-closed guarantee; modify the allow-net
resolution block (the code that builds zoneRecords with types.Record and tracks
v4Count/v4Strs after calling net.LookupIPAddr via lookupWithRetry/LookupIPAddr)
to check if v4Count == 0 and, in that case, log the condition and return a
non-nil error so lookupWithRetry treats the lookup as failed rather than
succeeded; keep the existing logging of v4_count/v4_ips and only append
types.Record entries for IPv4, but ensure the function returns an error when no
IPv4 addresses were found.

ldarren added 2 commits June 7, 2026 15:22
…ix wins

`buildAllowNetDNSZones` had three independent bugs that combined to
silently sinkhole allow-listed hostnames inside the guest, returning
0.0.0.0 for the lifetime of the box:

1. **No retry/timeout on host-side resolution.** Each allow-listed
   hostname was resolved exactly once with `context.Background()` and
   no timeout. A momentary host hiccup (VPN flap, slow corp resolver,
   mDNSResponder churn) silently dropped the zone, and the symptom
   `allow_zones=N-1 total_zones=N` was the only signal — a warning
   most callers never noticed. Now: 4 attempts with exponential
   backoff (100/300/900/2700 ms) and a 2 s per-attempt deadline.
   Successful retries log `INFO allowNet: DNS resolution succeeded
   after retry`.

2. **Silent half-baked sinkhole.** When resolution definitively
   failed, the zone was just dropped and box.create proceeded as if
   nothing was wrong — producing a sinkhole that would 0.0.0.0
   exactly the host the caller said they needed. Now: the resolver
   error propagates through `buildAllowNetDNSZones` →
   `buildDNSZones` → `buildTapConfig` → `gvproxy_create`, surfaced
   to the Rust caller via the new `errOut` channel from boxlite-ai#612 so the
   user-visible BoxliteError says
   `allowNet: resolve "iapi.example.com": ...` instead of
   "krun_start_enter returned: -22".

3. **Map-iteration zone order vs first-suffix-wins matcher.**
   gvisor-tap-vsock's DNS handler matches zones first-suffix-wins
   with no most-specific-match preference (services/dns/dns.go:51).
   We were emitting zones in Go map iteration order — random per
   process. With an allow-list of `[github.com, api.github.com,
   iapi.merck.com]`, an `iapi.merck.com.` query would sometimes hit
   the `com.` zone first (because `github.com` lives there), find no
   matching record name, and return that zone's
   `DefaultIP=0.0.0.0`. The same query hit `merck.com.` first on
   other runs and resolved correctly — explaining the "intermittent"
   reports. Now: zones are sorted longest-name-first before the
   slice is built, so the most-specific suffix always wins, and the
   catch-all root zone stays last.

Also adds two pieces of operability that made this debuggable:

- A build-identity log line on the first `gvproxy_create` of a shim
  process: `INFO gvproxy-bridge: build identity build_id=… vcs_revision=…
  vcs_modified=… go_version=…`. Lets a user confirm whether a freshly
  built shim is actually what's loaded — without it, "did the rebuild
  take effect?" is answered by guesswork.
- An `INFO allowNet: host resolved` line per allow-listed hostname,
  with the resolved IP list and v4 count. Visible at INFO so
  zero-IPv4 / IPv6-only hosts no longer hide.

Tests added (in `dns_filter_test.go`):

- `RetriesTransientResolverFailure` — flaky resolver that fails the
  first 2 attempts then succeeds; asserts retry recovers and bakes
  the host's records.
- `FailsClosedAfterRetries` — always-fail resolver; asserts the
  N-attempts-then-error contract (no zone with `Default=0.0.0.0`
  silently shipped).
- `HonorsPerAttemptTimeout` — hanging resolver that blocks until
  ctx cancellation; asserts the per-attempt deadline fires (without
  this, an unresponsive corp DNS server hangs box.create
  indefinitely).
- `PerHostFailureFailsAggregate` — multi-host allow-list where one
  host permafails; asserts the whole sinkhole build aborts and the
  failing host is named in the error.
- `LongestSuffixWinsBeforeRoot` — pins the bug-3 fix: with
  `[github.com, api.github.com, raw.githubusercontent.com,
  codeload.github.com, iapi.merck.com]`, asserts every per-TLD zone
  precedes the `com.` zone in the output slice, so the
  first-match-wins matcher sees `merck.com.` before `com.`.

The retry-loop timing constants are exposed as package-level `var`s
so tests can override them without slowing the suite (the timing
contract is what's pinned, not the specific durations).

Two-side verification: with the production changes reverted but the
new tests kept, the four reproducer tests fail with the expected
signals (`expected retry to recover`, `expected 4 attempts got 1`,
hangs to the 60s suite timeout, and `LongestSuffixWinsBeforeRoot`'s
ordering assertion); restore the changes and all pass.

Docs updated:

- `docs/faq.md` — new Networking entry
  "An allow_net host resolves to 0.0.0.0 inside the box" with
  diagnostic steps (where the gvproxy log lives, what
  `allow_zones=N-1 total_zones=N` looks like vs the new retry/fail
  warnings, why the sinkhole is baked once at create time and
  recreating the box — not restarting it — is what propagates a
  fix).
- `docs/reference/README.md` — bullet under the `network` notes
  describing host-side pre-resolution, retries, and fail-closed
  behavior so callers know `allow_net` entries must be
  resolvable on the host before `box.create`.
@ldarren ldarren force-pushed the fix/gvproxy-allownet-dns-resilience branch from 32fd74a to c1a3814 Compare June 7, 2026 07:23

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
docs/faq.md (1)

288-288: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language tags to fenced code blocks.

The fenced log examples are missing language identifiers, which triggers markdownlint MD040. Add text or log to each opening fence (e.g., ```text) for the blocks containing the INFO/WARN/ERROR gvproxy examples.

Also applies to: 296-296, 303-303

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/faq.md` at line 288, The fenced log example blocks in the FAQ contain
plaintext gvproxy INFO/WARN/ERROR logs and lack language tags (tripping
markdownlint MD040); update each opening triple-backtick fence for those gvproxy
examples to include a language identifier such as text or log (e.g., ```text) so
the INFO/WARN/ERROR blocks are correctly marked as plain text; target the fenced
code blocks that show the gvproxy log examples and add the language tag to each
opening fence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@docs/faq.md`:
- Line 288: The fenced log example blocks in the FAQ contain plaintext gvproxy
INFO/WARN/ERROR logs and lack language tags (tripping markdownlint MD040);
update each opening triple-backtick fence for those gvproxy examples to include
a language identifier such as text or log (e.g., ```text) so the INFO/WARN/ERROR
blocks are correctly marked as plain text; target the fenced code blocks that
show the gvproxy log examples and add the language tag to each opening fence.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ab6968a4-eb2d-46be-a7f7-502953d623fb

📥 Commits

Reviewing files that changed from the base of the PR and between 32fd74a and c1a3814.

📒 Files selected for processing (7)
  • docs/faq.md
  • docs/reference/README.md
  • src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/gvproxy_create_latency_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/main.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • docs/reference/README.md
  • src/deps/libgvproxy-sys/gvproxy-bridge/gvproxy_create_latency_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/config_build_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter_test.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/dns_filter.go
  • src/deps/libgvproxy-sys/gvproxy-bridge/main.go

@DorianZheng

Copy link
Copy Markdown
Member

solid fix — retry, fail-closed, and longest-suffix ordering each map cleanly onto the three root causes, and the injected-resolver tests are the right shape. two things i'd want resolved before merge, plus some cleanups.

signal handler inside the FFI export (main.go:985)

signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) runs inside gvproxy_create, which is a //exported symbol called from the rust runtime embedded in the python/node/go/c SDKs — a host process that owns its own signal disposition. in c-shared mode every box.create transiently diverts the host's SIGINT/SIGTERM into go's signal machinery. a host app with its own SIGTERM handler (graceful shutdown, supervisor) could see different behavior during any create. the cancellation goal is right; a library grabbing process signals to achieve it is the part i'd push back on. can the cancellation be driven by a context the rust caller already controls across the FFI boundary instead? and was this exercised with an embedding process that installs its own handlers?

fail-closed doesn't cover ipv6-only hosts (dns_filter.go:299-329)

lookupWithRetry treats len(ips) > 0 as success, but resolveAndAddRecords then drops every non-v4 address (if ip.IP.To4() == nil { continue }). an allow-listed host that resolves AAAA-only gets a zone with zero A records, falls through to that zone's DefaultIP=0.0.0.0, and silently sinkholes — the exact symptom this PR sets out to kill. it's logged (v4_count=0) but not failed. meanwhile docs/reference/README.md:83-90 states create "fails — by design, so a half-baked sinkhole never silently sinkholes an allow-listed host to 0.0.0.0", which is only true for resolution errors, not v4-less results. the code and the doc guarantee disagree; one of them needs to move.

minor

  • config_build comment above const dnsLookupAttempts says "exposed as var (not const) so tests can override timing" — it's describing the var (...) block below it but sits on the const. move it so the const isn't covered by an "exposed as var" sentence.
  • dns_filter_test.go hand-rolls contains/indexOf instead of importing strings and calling strings.Contains — more code, more to get wrong, just import it.
  • flakyResolver.successIPs and attemptHook are read but never set by any test — dead scaffolding, drop until something needs them.
  • the zone sort uses sort.Slice (unstable) but the comment claims "deterministic" ordering — equal-length zone names get an unspecified order. doesn't affect correctness (equal length can't shadow), and the test only uses distinct-length names so it wouldn't catch a regression; use SliceStable or a lexical tiebreak if you want the comment to hold literally.
  • dnsLookupInitialBackoffVar/dnsLookupAttemptTimeoutVar carry a Var suffix that dnsLookupBackoffFactor (also a var) doesn't — inconsistent, and it leaks "is a variable" into the name with no const counterpart to disambiguate from.

worth a note, not a blocker

  • worst-case create latency on the failure path is sequential per host: ~4×2s timeouts + ~4s backoff ≈ 12s per unreachable host, so a 10-entry allow_net of dead hosts is ~2min before create fails (interruptible, modulo the signal issue above). a line in docs/reference would save someone the surprise.
  • the longest-suffix test asserts slice order as a proxy for gvisor-tap-vsock's first-suffix-wins matching — reasonable since you can't easily drive the real matcher, but the whole fix rests on that upstream contract. a link to the gvisor-tap-vsock matcher code would harden it against an upstream behavior change the test can't see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants