Skip to content

test(net): hermetic + e2e coverage for #645 allowlist DNS resilience#640

Draft
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:feat/runtime-guard-and-df
Draft

test(net): hermetic + e2e coverage for #645 allowlist DNS resilience#640
G4614 wants to merge 2 commits into
boxlite-ai:mainfrom
G4614:feat/runtime-guard-and-df

Conversation

@G4614

@G4614 G4614 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Paired with #645 — adds the test coverage audit flagged as missing on that PR (PR slot feat/runtime-guard-and-df reused per request; original disk-guard content superseded by #647).

Two-sided verification

Toggled each of the 3 fixes in dns_filter.go and observed the corresponding new test flip red → green:

toggle (revert the fix) which test goes red red message
sort.Slice(... > ...)< (shortest-first instead of longest-first) LongestSuffixWinsBeforeRoot github.com. must precede com.
return nil, lastErrreturn nil, nil in lookupWithRetry (silent failure on exhausted retries) ProductionSymptomReproducer REGRESSION — fix gone: buildTapConfig returned a silent half-baked config with allow_zones=3 total_zones=4 while one allow_net host failed to resolve. This is the exact symptom users reported.
Catch-all root zone append(zones, ...) → prepend (move sinkhole to front) CatchAllRootZoneAlwaysLast (3/4 sub-tests; 0_hosts stays green because there's no other zone to displace it) last zone must be root catch-all (Name==""), got "net." at index 4/5

Each toggle was applied in isolation, the affected test was run, the message above captured, then the file restored and the full suite re-run to confirm all 14 tests green again. This is the verification mode #645 itself didn't have — its tests asserted invariants but none was confirmed to flip red when the corresponding production line is removed.

Test plan

cd src/deps/libgvproxy-sys/gvproxy-bridge
go test -run 'TestBuildAllowNetDNSZones_|TestBuildTapConfig_' -v -timeout 60s

All 14 tests green. The 4 original tests from #645 are unchanged.

What's added (brief)

  • LongestSuffixWinsBeforeRoot rewritten to use fixedIPResolver instead of calling the real system resolver against iapi.merck.com (Merck-internal — NXDOMAINs on every public CI host, so the test was failing for the wrong reason and never exercised the sort logic).
  • CatchAllRootZoneAlwaysLast (sub-tests 0/1/4 hosts) — pins the invariant the source comments state but no test asserted.
  • BuildTapConfig_DNSResolutionFailure_PropagatesErrorNotSilentConfig — end-to-end through buildAllowNetDNSZones → buildTapConfig with an RFC 6761 .invalid host.
  • BuildTapConfig_NoAllowNet_HappyPath_NoDNSLookup — negative control with 50ms upper-bound timer.
  • BuildTapConfig_ProductionSymptomReproducer — operator-shaped reproducer, message names the exact allow_zones=N-1 total_zones=N shape.

Delta: dns_filter_test.go +222 / -2. One small mock helper (fixedIPResolver, ~5 lines).

Out of scope

  • True FFI-level e2e through gvproxy_create (cgo //export, needs socket + VM stub).
  • Rust-side box.create integration test.
  • The 4 other silent goroutines in main.go (OverrideTCPHandler, VFKit/Qemu Accept, protocol handlers) — same shape as fix(net): surface gvproxy bind errors through FFI #612's race, separate PR worth of work.

🤖 Generated with Claude Code

G4614 added a commit to G4614/boxlite that referenced this pull request Jun 1, 2026
…icy walls

Replaces the entire previous boxlite-ai#618 (admission guard + recovery budget +
auto-GC + per-command statvfs check, ~1500 LOC) with a structural
fallocate-based reserve (~200 LOC). The kernel now enforces the floor
at every write(2); boxlite owns only the reserve file's lifecycle and
the operator's recovery affordance.

What lands:

  - boxlite::util::reserve — `ensure_reserve(home)` preallocates 64 MiB
    into `$BOXLITE_HOME/.reserve` via fallocate(mode=0). Idempotent +
    self-healing: top-up if size dropped, recreate if the file was
    removed. fallback to a 64 MiB zero-write when fallocate returns
    EOPNOTSUPP (tmpfs / some FUSE backends).

  - RuntimeImpl::new calls ensure_reserve right after layout.prepare().
    From that moment on the host filesystem's f_bavail is 64 MiB lower
    for every writer — boxlite, the operator's other tools, anything
    else. No per-command statvfs poll, no policy table to maintain.

  - `boxlite reserve-release` CLI command — emergency: unlink the
    reserve so the operator can run gc / rm on a full host. unlink(2)
    is metadata-only on ext4/xfs/btrfs, so it works at 0 free. The
    next runtime construction will lay the reserve back down
    automatically.

  - CLI dispatcher catches ENOSPC chains and prints a one-line hint
    pointing at `boxlite reserve-release`. Substring + raw_os_error
    match so it works whether the error came from std::io::Error,
    BoxliteError::Storage(String), or a wrapped reqwest body upload.

What goes away from the original boxlite-ai#618 scope (deliberately):

  - DiskSpaceTask init task + classify() three-tier thresholds
  - enforce_recovery_budget calls in 6 CLI commands + 6 REST handlers
  - periodic_recovery_budget_monitor for serve
  - boxlite gc + collect_garbage + sweep_orphan_disk_images + auto-GC
    self-heal (these were `prevention via reactive recovery`; with the
    structural reserve the equivalent recovery is one operator-driven
    `boxlite reserve-release` + manual cleanup, which is more
    predictable than auto-GC and surfaces ENOSPC cleanly to scripts)

boxlite-ai#639 (GC scope expansion) and boxlite-ai#640 (RuntimeBackend wiring + boxlite df)
will need to be restructured as standalone follow-ups since their base
in this branch is now gone — addressed in a separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ix wins

`buildAllowNetDNSZones` had three independent bugs that combined to
silently sinkhole allow-listed hostnames inside the guest, returning
0.0.0.0 for the lifetime of the box:

1. **No retry/timeout on host-side resolution.** Each allow-listed
   hostname was resolved exactly once with `context.Background()` and
   no timeout. A momentary host hiccup (VPN flap, slow corp resolver,
   mDNSResponder churn) silently dropped the zone, and the symptom
   `allow_zones=N-1 total_zones=N` was the only signal — a warning
   most callers never noticed. Now: 4 attempts with exponential
   backoff (100/300/900/2700 ms) and a 2 s per-attempt deadline.
   Successful retries log `INFO allowNet: DNS resolution succeeded
   after retry`.

2. **Silent half-baked sinkhole.** When resolution definitively
   failed, the zone was just dropped and box.create proceeded as if
   nothing was wrong — producing a sinkhole that would 0.0.0.0
   exactly the host the caller said they needed. Now: the resolver
   error propagates through `buildAllowNetDNSZones` →
   `buildDNSZones` → `buildTapConfig` → `gvproxy_create`, surfaced
   to the Rust caller via the new `errOut` channel from boxlite-ai#612 so the
   user-visible BoxliteError says
   `allowNet: resolve "iapi.example.com": ...` instead of
   "krun_start_enter returned: -22".

3. **Map-iteration zone order vs first-suffix-wins matcher.**
   gvisor-tap-vsock's DNS handler matches zones first-suffix-wins
   with no most-specific-match preference (services/dns/dns.go:51).
   We were emitting zones in Go map iteration order — random per
   process. With an allow-list of `[github.com, api.github.com,
   iapi.merck.com]`, an `iapi.merck.com.` query would sometimes hit
   the `com.` zone first (because `github.com` lives there), find no
   matching record name, and return that zone's
   `DefaultIP=0.0.0.0`. The same query hit `merck.com.` first on
   other runs and resolved correctly — explaining the "intermittent"
   reports. Now: zones are sorted longest-name-first before the
   slice is built, so the most-specific suffix always wins, and the
   catch-all root zone stays last.

Also adds two pieces of operability that made this debuggable:

- A build-identity log line on the first `gvproxy_create` of a shim
  process: `INFO gvproxy-bridge: build identity build_id=… vcs_revision=…
  vcs_modified=… go_version=…`. Lets a user confirm whether a freshly
  built shim is actually what's loaded — without it, "did the rebuild
  take effect?" is answered by guesswork.
- An `INFO allowNet: host resolved` line per allow-listed hostname,
  with the resolved IP list and v4 count. Visible at INFO so
  zero-IPv4 / IPv6-only hosts no longer hide.

Tests added (in `dns_filter_test.go`):

- `RetriesTransientResolverFailure` — flaky resolver that fails the
  first 2 attempts then succeeds; asserts retry recovers and bakes
  the host's records.
- `FailsClosedAfterRetries` — always-fail resolver; asserts the
  N-attempts-then-error contract (no zone with `Default=0.0.0.0`
  silently shipped).
- `HonorsPerAttemptTimeout` — hanging resolver that blocks until
  ctx cancellation; asserts the per-attempt deadline fires (without
  this, an unresponsive corp DNS server hangs box.create
  indefinitely).
- `PerHostFailureFailsAggregate` — multi-host allow-list where one
  host permafails; asserts the whole sinkhole build aborts and the
  failing host is named in the error.
- `LongestSuffixWinsBeforeRoot` — pins the bug-3 fix: with
  `[github.com, api.github.com, raw.githubusercontent.com,
  codeload.github.com, iapi.merck.com]`, asserts every per-TLD zone
  precedes the `com.` zone in the output slice, so the
  first-match-wins matcher sees `merck.com.` before `com.`.

The retry-loop timing constants are exposed as package-level `var`s
so tests can override them without slowing the suite (the timing
contract is what's pinned, not the specific durations).

Two-side verification: with the production changes reverted but the
new tests kept, the four reproducer tests fail with the expected
signals (`expected retry to recover`, `expected 4 attempts got 1`,
hangs to the 60s suite timeout, and `LongestSuffixWinsBeforeRoot`'s
ordering assertion); restore the changes and all pass.

Docs updated:

- `docs/faq.md` — new Networking entry
  "An allow_net host resolves to 0.0.0.0 inside the box" with
  diagnostic steps (where the gvproxy log lives, what
  `allow_zones=N-1 total_zones=N` looks like vs the new retry/fail
  warnings, why the sinkhole is baked once at create time and
  recreating the box — not restarting it — is what propagates a
  fix).
- `docs/reference/README.md` — bullet under the `network` notes
  describing host-side pre-resolution, retries, and fail-closed
  behavior so callers know `allow_net` entries must be
  resolvable on the host before `box.create`.
…erage

Fixes the broken `LongestSuffixWinsBeforeRoot` test (it called the real
resolver and died with NXDOMAIN on every machine outside Merck corp DNS,
so the test failed for the wrong reason and never actually exercised
the sort logic). Adds three more tests to cover the gaps an audit
flagged in boxlite-ai#645:

  - LongestSuffixWinsBeforeRoot now drives buildAllowNetDNSZonesWith()
    with a fixedIPResolver. Same assertions on zone ordering, but
    hermetic — no network dependency, no public DNS lookup.

  - CatchAllRootZoneAlwaysLast pins the invariant the dns_filter.go
    comments state ("Catch-all root zone: sinkhole everything not
    explicitly allowed") but no test ever asserted. Runs across 0/1/4
    host configurations as sub-tests.

  - BuildTapConfig_DNSResolutionFailure_PropagatesErrorNotSilentConfig
    is the end-to-end leg covering buildAllowNetDNSZones →
    buildTapConfig. Drives the chain with an RFC 6761 `.invalid` host
    (guaranteed NXDOMAIN), overrides dnsLookupAttemptTimeoutVar to
    keep the test fast, asserts (err != nil) AND (tapCfg == nil) AND
    err mentions the failing hostname.

  - BuildTapConfig_NoAllowNet_HappyPath_NoDNSLookup is the negative
    control — if a regression makes buildTapConfig hit the resolver
    even when allow_net is empty, the test trips a 50ms upper bound
    on elapsed time instead of silently passing.

  - BuildTapConfig_ProductionSymptomReproducer is the operator-shaped
    reproducer: N=3 allow_net hosts where 2 resolve and 1 doesn't.
    Failure message names the exact production symptom shape
    ("silent half-baked config with allow_zones=N-1 total_zones=N")
    so a future regression review immediately knows what was lost.

Two-sided verified by toggling production lines:
  - Toggle `sort.Slice(... > ...)` to `<`: LongestSuffixWinsBeforeRoot
    flips red ("github.com. must precede com.")
  - Toggle `return nil, lastErr` to `return nil, nil` in lookupWithRetry:
    ProductionSymptomReproducer flips red with the literal "REGRESSION —
    fix gone: ... allow_zones=3 total_zones=4" message it was designed
    to surface.
  - Toggle the catch-all root zone from `append` (end) to prepend
    (start): CatchAllRootZoneAlwaysLast flips red on the 3 non-empty
    sub-tests (the 0_hosts sub-test stays green because there's no
    other zone to displace it).

Coverage delta:
  +149 lines of new tests + 1 mock resolver (fixedIPResolver, ~5 lines).
  All 14 tests now green on a public-internet host. PR boxlite-ai#645's original
  4 tests (RetriesTransientResolverFailure, FailsClosedAfterRetries,
  HonorsPerAttemptTimeout, PerHostFailureFailsAggregate) unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the feat/runtime-guard-and-df branch from 73827d2 to 2391183 Compare June 3, 2026 09:28
@G4614 G4614 changed the title feat(disk_guard,df): sink recovery budget into RuntimeBackend + boxlite df test(net): hermetic + e2e coverage for #645 allowlist DNS resilience Jun 3, 2026
@cla-assistant

cla-assistant Bot commented Jun 3, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ G4614
❌ Darren Liew


Darren Liew seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant