test(net): hermetic + e2e coverage for #645 allowlist DNS resilience#640
Draft
G4614 wants to merge 2 commits into
Draft
test(net): hermetic + e2e coverage for #645 allowlist DNS resilience#640G4614 wants to merge 2 commits into
G4614 wants to merge 2 commits into
Conversation
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 1, 2026
…icy walls Replaces the entire previous boxlite-ai#618 (admission guard + recovery budget + auto-GC + per-command statvfs check, ~1500 LOC) with a structural fallocate-based reserve (~200 LOC). The kernel now enforces the floor at every write(2); boxlite owns only the reserve file's lifecycle and the operator's recovery affordance. What lands: - boxlite::util::reserve — `ensure_reserve(home)` preallocates 64 MiB into `$BOXLITE_HOME/.reserve` via fallocate(mode=0). Idempotent + self-healing: top-up if size dropped, recreate if the file was removed. fallback to a 64 MiB zero-write when fallocate returns EOPNOTSUPP (tmpfs / some FUSE backends). - RuntimeImpl::new calls ensure_reserve right after layout.prepare(). From that moment on the host filesystem's f_bavail is 64 MiB lower for every writer — boxlite, the operator's other tools, anything else. No per-command statvfs poll, no policy table to maintain. - `boxlite reserve-release` CLI command — emergency: unlink the reserve so the operator can run gc / rm on a full host. unlink(2) is metadata-only on ext4/xfs/btrfs, so it works at 0 free. The next runtime construction will lay the reserve back down automatically. - CLI dispatcher catches ENOSPC chains and prints a one-line hint pointing at `boxlite reserve-release`. Substring + raw_os_error match so it works whether the error came from std::io::Error, BoxliteError::Storage(String), or a wrapped reqwest body upload. What goes away from the original boxlite-ai#618 scope (deliberately): - DiskSpaceTask init task + classify() three-tier thresholds - enforce_recovery_budget calls in 6 CLI commands + 6 REST handlers - periodic_recovery_budget_monitor for serve - boxlite gc + collect_garbage + sweep_orphan_disk_images + auto-GC self-heal (these were `prevention via reactive recovery`; with the structural reserve the equivalent recovery is one operator-driven `boxlite reserve-release` + manual cleanup, which is more predictable than auto-GC and surfaces ENOSPC cleanly to scripts) boxlite-ai#639 (GC scope expansion) and boxlite-ai#640 (RuntimeBackend wiring + boxlite df) will need to be restructured as standalone follow-ups since their base in this branch is now gone — addressed in a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ix wins `buildAllowNetDNSZones` had three independent bugs that combined to silently sinkhole allow-listed hostnames inside the guest, returning 0.0.0.0 for the lifetime of the box: 1. **No retry/timeout on host-side resolution.** Each allow-listed hostname was resolved exactly once with `context.Background()` and no timeout. A momentary host hiccup (VPN flap, slow corp resolver, mDNSResponder churn) silently dropped the zone, and the symptom `allow_zones=N-1 total_zones=N` was the only signal — a warning most callers never noticed. Now: 4 attempts with exponential backoff (100/300/900/2700 ms) and a 2 s per-attempt deadline. Successful retries log `INFO allowNet: DNS resolution succeeded after retry`. 2. **Silent half-baked sinkhole.** When resolution definitively failed, the zone was just dropped and box.create proceeded as if nothing was wrong — producing a sinkhole that would 0.0.0.0 exactly the host the caller said they needed. Now: the resolver error propagates through `buildAllowNetDNSZones` → `buildDNSZones` → `buildTapConfig` → `gvproxy_create`, surfaced to the Rust caller via the new `errOut` channel from boxlite-ai#612 so the user-visible BoxliteError says `allowNet: resolve "iapi.example.com": ...` instead of "krun_start_enter returned: -22". 3. **Map-iteration zone order vs first-suffix-wins matcher.** gvisor-tap-vsock's DNS handler matches zones first-suffix-wins with no most-specific-match preference (services/dns/dns.go:51). We were emitting zones in Go map iteration order — random per process. With an allow-list of `[github.com, api.github.com, iapi.merck.com]`, an `iapi.merck.com.` query would sometimes hit the `com.` zone first (because `github.com` lives there), find no matching record name, and return that zone's `DefaultIP=0.0.0.0`. The same query hit `merck.com.` first on other runs and resolved correctly — explaining the "intermittent" reports. Now: zones are sorted longest-name-first before the slice is built, so the most-specific suffix always wins, and the catch-all root zone stays last. Also adds two pieces of operability that made this debuggable: - A build-identity log line on the first `gvproxy_create` of a shim process: `INFO gvproxy-bridge: build identity build_id=… vcs_revision=… vcs_modified=… go_version=…`. Lets a user confirm whether a freshly built shim is actually what's loaded — without it, "did the rebuild take effect?" is answered by guesswork. - An `INFO allowNet: host resolved` line per allow-listed hostname, with the resolved IP list and v4 count. Visible at INFO so zero-IPv4 / IPv6-only hosts no longer hide. Tests added (in `dns_filter_test.go`): - `RetriesTransientResolverFailure` — flaky resolver that fails the first 2 attempts then succeeds; asserts retry recovers and bakes the host's records. - `FailsClosedAfterRetries` — always-fail resolver; asserts the N-attempts-then-error contract (no zone with `Default=0.0.0.0` silently shipped). - `HonorsPerAttemptTimeout` — hanging resolver that blocks until ctx cancellation; asserts the per-attempt deadline fires (without this, an unresponsive corp DNS server hangs box.create indefinitely). - `PerHostFailureFailsAggregate` — multi-host allow-list where one host permafails; asserts the whole sinkhole build aborts and the failing host is named in the error. - `LongestSuffixWinsBeforeRoot` — pins the bug-3 fix: with `[github.com, api.github.com, raw.githubusercontent.com, codeload.github.com, iapi.merck.com]`, asserts every per-TLD zone precedes the `com.` zone in the output slice, so the first-match-wins matcher sees `merck.com.` before `com.`. The retry-loop timing constants are exposed as package-level `var`s so tests can override them without slowing the suite (the timing contract is what's pinned, not the specific durations). Two-side verification: with the production changes reverted but the new tests kept, the four reproducer tests fail with the expected signals (`expected retry to recover`, `expected 4 attempts got 1`, hangs to the 60s suite timeout, and `LongestSuffixWinsBeforeRoot`'s ordering assertion); restore the changes and all pass. Docs updated: - `docs/faq.md` — new Networking entry "An allow_net host resolves to 0.0.0.0 inside the box" with diagnostic steps (where the gvproxy log lives, what `allow_zones=N-1 total_zones=N` looks like vs the new retry/fail warnings, why the sinkhole is baked once at create time and recreating the box — not restarting it — is what propagates a fix). - `docs/reference/README.md` — bullet under the `network` notes describing host-side pre-resolution, retries, and fail-closed behavior so callers know `allow_net` entries must be resolvable on the host before `box.create`.
…erage Fixes the broken `LongestSuffixWinsBeforeRoot` test (it called the real resolver and died with NXDOMAIN on every machine outside Merck corp DNS, so the test failed for the wrong reason and never actually exercised the sort logic). Adds three more tests to cover the gaps an audit flagged in boxlite-ai#645: - LongestSuffixWinsBeforeRoot now drives buildAllowNetDNSZonesWith() with a fixedIPResolver. Same assertions on zone ordering, but hermetic — no network dependency, no public DNS lookup. - CatchAllRootZoneAlwaysLast pins the invariant the dns_filter.go comments state ("Catch-all root zone: sinkhole everything not explicitly allowed") but no test ever asserted. Runs across 0/1/4 host configurations as sub-tests. - BuildTapConfig_DNSResolutionFailure_PropagatesErrorNotSilentConfig is the end-to-end leg covering buildAllowNetDNSZones → buildTapConfig. Drives the chain with an RFC 6761 `.invalid` host (guaranteed NXDOMAIN), overrides dnsLookupAttemptTimeoutVar to keep the test fast, asserts (err != nil) AND (tapCfg == nil) AND err mentions the failing hostname. - BuildTapConfig_NoAllowNet_HappyPath_NoDNSLookup is the negative control — if a regression makes buildTapConfig hit the resolver even when allow_net is empty, the test trips a 50ms upper bound on elapsed time instead of silently passing. - BuildTapConfig_ProductionSymptomReproducer is the operator-shaped reproducer: N=3 allow_net hosts where 2 resolve and 1 doesn't. Failure message names the exact production symptom shape ("silent half-baked config with allow_zones=N-1 total_zones=N") so a future regression review immediately knows what was lost. Two-sided verified by toggling production lines: - Toggle `sort.Slice(... > ...)` to `<`: LongestSuffixWinsBeforeRoot flips red ("github.com. must precede com.") - Toggle `return nil, lastErr` to `return nil, nil` in lookupWithRetry: ProductionSymptomReproducer flips red with the literal "REGRESSION — fix gone: ... allow_zones=3 total_zones=4" message it was designed to surface. - Toggle the catch-all root zone from `append` (end) to prepend (start): CatchAllRootZoneAlwaysLast flips red on the 3 non-empty sub-tests (the 0_hosts sub-test stays green because there's no other zone to displace it). Coverage delta: +149 lines of new tests + 1 mock resolver (fixedIPResolver, ~5 lines). All 14 tests now green on a public-internet host. PR boxlite-ai#645's original 4 tests (RetriesTransientResolverFailure, FailsClosedAfterRetries, HonorsPerAttemptTimeout, PerHostFailureFailsAggregate) unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
73827d2 to
2391183
Compare
|
Darren Liew seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Paired with #645 — adds the test coverage audit flagged as missing on that PR (PR slot
feat/runtime-guard-and-dfreused per request; original disk-guard content superseded by #647).Two-sided verification
Toggled each of the 3 fixes in
dns_filter.goand observed the corresponding new test flip red → green:sort.Slice(... > ...)→<(shortest-first instead of longest-first)LongestSuffixWinsBeforeRootgithub.com. must precede com.return nil, lastErr→return nil, nilinlookupWithRetry(silent failure on exhausted retries)ProductionSymptomReproducerREGRESSION — fix gone: buildTapConfig returned a silent half-baked config with allow_zones=3 total_zones=4 while one allow_net host failed to resolve. This is the exact symptom users reported.append(zones, ...)→ prepend (move sinkhole to front)CatchAllRootZoneAlwaysLast(3/4 sub-tests;0_hostsstays green because there's no other zone to displace it)last zone must be root catch-all (Name==""), got "net." at index 4/5Each toggle was applied in isolation, the affected test was run, the message above captured, then the file restored and the full suite re-run to confirm all 14 tests green again. This is the verification mode #645 itself didn't have — its tests asserted invariants but none was confirmed to flip red when the corresponding production line is removed.
Test plan
All 14 tests green. The 4 original tests from #645 are unchanged.
What's added (brief)
LongestSuffixWinsBeforeRootrewritten to usefixedIPResolverinstead of calling the real system resolver againstiapi.merck.com(Merck-internal — NXDOMAINs on every public CI host, so the test was failing for the wrong reason and never exercised the sort logic).CatchAllRootZoneAlwaysLast(sub-tests 0/1/4 hosts) — pins the invariant the source comments state but no test asserted.BuildTapConfig_DNSResolutionFailure_PropagatesErrorNotSilentConfig— end-to-end throughbuildAllowNetDNSZones → buildTapConfigwith an RFC 6761.invalidhost.BuildTapConfig_NoAllowNet_HappyPath_NoDNSLookup— negative control with 50ms upper-bound timer.BuildTapConfig_ProductionSymptomReproducer— operator-shaped reproducer, message names the exactallow_zones=N-1 total_zones=Nshape.Delta:
dns_filter_test.go+222 / -2. One small mock helper (fixedIPResolver, ~5 lines).Out of scope
gvproxy_create(cgo//export, needs socket + VM stub).box.createintegration test.main.go(OverrideTCPHandler, VFKit/QemuAccept, protocol handlers) — same shape as fix(net): surface gvproxy bind errors through FFI #612's race, separate PR worth of work.🤖 Generated with Claude Code